Do LLM Agents Have Regret?
A Case Study in Online Learning and Games

Chanwoo Park Chanwoo Park Equal Contribution Xiangyu Liu footnotemark: Asuman Ozdaglar Kaiqing Zhang C. Park and A. Ozdaglar are with Massachusetts Institute of Technology, Cambridge, MA, 02139. X. Liu and K. Zhang are with the University of Maryland, College Park, MD, 20742. E-mails: {cpark97,asuman}@mit.edu, {xyliu999,kaiqing}@umd.edu.
Abstract

Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of regret . We first empirically study the no-regret behaviors of LLMs in canonical (non-stationary) online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To promote the no-regret behaviors, we propose a novel unsupervised training loss of regret-loss , which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish the statistical guarantee of generalization bound for regret-loss minimization, followed by the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above “regrettable” cases.

1 Introduction

Live Life with No Excuses. Travel with No Regret.

Large language models (LLMs) have recently exhibited remarkable reasoning capabilities (Bubeck et al., 2023 ; Achiam et al., 2023 ; Wei et al., 2022b ; Yao et al., 2023a ) . As a consequence, a burgeoning body of work has been investigating the employment of LLMs as central controllers for (interactive) decision-making, through the construction of LLM-based autonomous agents (Hao et al., 2023 ; Shen et al., 2023 ; Yao et al., 2023b ; Shinn et al., 2023 ; Wang et al., 2023c ; Significant Gravitas, 2023 ) . Specifically, the LLM agent interacts with the (physical) world in a dynamic/sequential way: it uses LLMs as an oracle for reasoning, then acts in the environment based on the reasoning and the feedback it perceives over time. LLM agent has achieved impressive successes in embodied AI (Ahn et al., 2022 ; Huang et al., 2022a ; Wang et al., 2023a ) , natural science (Wu et al., 2023 ; Swan et al., 2023 ) , and social science (Park et al., 2022 , 2023 ) applications.

Besides being dynamic , another increasingly captivating feature of LLM-based decision-making is the involvement of strategic interactions, oftentimes among multiple LLM agents. For example, it has been continually reported that the reasoning capability of LLMs can be improved by interacting with each other through negotiation and/or debate games (Fu et al., 2023 ; Du et al., 2023 ) ; LLM agents have now been widely used to simulate the strategic behaviors for social and economic studies, to understand the emerging behaviors in interactive social systems (Aher et al., 2023 ; Park et al., 2023 ) . Moreover, LLMs have also exhibited remarkable potential in solving various games (Bakhtin et al., 2022 ; Mukobi et al., 2023 ) , and in fact, a rapidly expanding literature has employed repeated games as a fundamental benchmark to understand the strategic behaviors of LLMs (Brookins and DeBacker, 2023 ; Akata et al., 2023 ; Fan et al., 2023 ) . These exciting empirical successes call for a rigorous examination and understanding through a theoretical lens of decision-making.

Regret , on the other hand, has been a core metric in (online) decision-making. It measures how “sorry” the decision-maker is, in retrospect, not to have followed the best prediction in hindsight (Shalev-Shwartz, 2012 ) . It provides not only a sensible way to evaluate the intelligence level of online decision-makers, but also a quantitative way to measure their robustness against arbitrary (and possibly adversarial) environments. More importantly, it inherently offers a connection to modeling and analyzing strategic behaviors: the long-run interaction of no-regret learners leads to certain equilibria when they repeatedly play games (Cesa-Bianchi and Lugosi, 2006 ) . In fact, no-regret learning has served as a natural model for predicting and explaining human behaviors in strategic decision-making, with experimental evidence (Erev and Roth, 1998 ; Nekipelov et al., 2015 ; Balseiro and Gur, 2019 ) . It has thus been posited as an important model of “rational behavior” in playing games (Blum et al., 2008 ; Roughgarden, 2015 ; Roughgarden et al., 2017 ) . Thus, it is natural to ask:

Can we examine and better understand the online and strategic decision-making
behaviors of LLMs through the lens of regret ?

Acknowledging that LLM(-agents) are extremely complicated to analyze, to gain some insights into the question, we focus on benchmark decision-making settings: online learning with convex (linear) loss functions, and playing repeated games. We summarize our contributions as follows.

Contributions.

First, we carefully examine the performance of several representative pre-trained LLMs in aforementioned benchmark online decision-making settings, in terms of regret . We observe that oftentimes, LLM agents exhibit no-regret behaviors in these (non-stationary) online learning settings, where the loss functions change over time either arbitrarily (and even adversarially), or by following some patterns with bounded variation, and in playing both representative and randomly generated repeated games. For the latter, equilibria will emerge as the long-term behavior of the interactions when all LLM agents are no-regret. Second, we provide some theoretical insights into the observed no-regret behaviors, based on some hypothetical model of the human decision-makers who generate the data, and certain assumptions on the supervised pre-training procedure, a common practice in training large models for decision-making. In particular, we make a connection of pre-trained LLMs to the known no-regret algorithm of follow-the-perturbed-leader (FTPL) under such assumptions. Third, we also identify (simple) cases where advanced LLMs as GPT-4 fail to be no-regret. We thus propose a novel unsupervised training loss, regret-loss , which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish both statistical and optimization guarantees for regret-loss minimization, showing that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of regret-loss, especially in addressing the above “regrettable” cases.

1.1 Related Work

LLM(-agent) for decision-making.

The impressive capability of LLMs for reasoning (Bubeck et al., 2023 ; Achiam et al., 2023 ; Wei et al., 2022b , a ; Srivastava et al., 2023 ; Yao et al., 2023a ) has inspired a growing line of research on LLM for (interactive) decision-making , i.e., an LLM-based autonomous agent interacts with the environment by taking actions repeatedly/sequentially, based on the feedback it perceives. Some promises have been shown from a planning perspective (Hao et al., 2023 ; Valmeekam et al., 2023 ; Huang et al., 2022b ; Shen et al., 2023 ) . In particular, for embodied AI applications, e.g., robotics, LLMs have achieved impressive performance when used as the controller for decision-making (Ahn et al., 2022 ; Yao et al., 2023b ; Shinn et al., 2023 ; Wang et al., 2023c ; Driess et al., 2023 ; Significant Gravitas, 2023 ) . However, the performance of decision-making has not been rigorously characterized via the regret metric in these works. Very recently, Liu et al. ( 2023c ) has proposed a principled architecture for LLM-agent, with provable regret guarantees in stationary and stochastic decision-making environments, under the Bayesian adaptive Markov decision processes framework. In contrast, our work focuses on online learning and game-theoretic settings, in potentially adversarial and non-stationary environments. Moreover, (first part of) our work focuses on evaluating the intelligence level of LLM per se in decision-making (in terms of the regret metric), while Liu et al. ( 2023c ) focused on developing a new architecture that uses LLM as an oracle for reasoning, together with memory and specific planning/acting subroutines, to achieve sublinear (Bayesian) regret, in stationary and stochastic environments.

LLMs in multi-agent environments.

The interaction of multiple LLM agents has garnered significant attention lately. For example, Fu et al. ( 2023 ) showed that LLMs can autonomously improve each other in a negotiation game by playing and criticizing each other. Similarly, (Du et al., 2023 ; Liang et al., 2023 ; Xiong et al., 2023 ; Chan et al., 2024 ; Li et al., 2023c ) showed that multi-LLM debate can improve the reasoning and evaluation capabilities of the LLMs. Qian et al. ( 2023 ); Schick et al. ( 2023 ); Wu et al. ( 2023 ) demonstrated the potential of multi-LLM interactions and collaboration in software development, writing, and problem-solving, respectively. Zhang et al. ( 2024 ) exhibited a similar potential in embodied cooperative environments. More formally, multi-LLM interactions have also been investigated under a game-theoretic framework, to characterize the strategic decision-making of LLM agents. Bakhtin et al. ( 2022 ); Mukobi et al. ( 2023 ) and Xu et al. ( 2023b , a ) have demonstrated the promise of LLMs in playing Diplomacy and WereWolf games, respectively, which are both language-based games with a mixture of competitive and cooperative agents. Note that these works utilized LLM to solve a specific rather than a general game. Related to our work, Brookins and DeBacker ( 2023 ); Akata et al. ( 2023 ); Lorè and Heydari ( 2023 ); Brookins and DeBacker ( 2023 ); Fan et al. ( 2023 ) have also used (repeated) matrix games as a benchmark to evaluate the reasoning capability and rationality of LLM agents. In contrast to our work, these empirical studies have not formally investigated LLM agents using the metric of regret , nor through the lenses of online learning and equilibrium-computation , which are all fundamental in modeling and analyzing strategic multi-agent interactions. Moreover, our work also provides theoretical results to explain and further enhance the no-regret property of LLM agents.

LLMs & Human/Social behavior.

LLMs have also been used to simulate the behavior of human beings, for social science and economics studies (Engel et al., 2023 ) . The extent of LLMs simulating human behavior has been claimed as a way to evaluate the level of its intelligence in a controlled environment (Aher et al., 2023 ; Tsai et al., 2023 ) . For example, Li et al. ( 2023b ); Hong et al. ( 2024 ); Zhao et al. ( 2023 ) showed that by specifying different “roles” to LLM agents, certain collaborative/competitive behaviors can emerge. Argyle et al. ( 2023 ) showed that LLMs can emulate response distributions from diverse human subgroups, illustrating their adaptability. Horton ( 2023 ) argued that an LLM, as a computational model of humans, can be used as homo economicus when given endowments, information, preferences, etc., to gain new economic insights by simulating its interaction with other LLMs. Park et al. ( 2022 , 2023 ) proposed scalable simulators that can generate realistic social behaviors emerging in populated and interactive social systems, and the emerging behaviors of LLM agents in society have also been consistently observed in Chen et al. ( 2024 , 2023 ) . Li et al. ( 2023d , a ) studied the opinion/behavioral dynamics of LLM agents on social networks. These empirical results have inspired our work, which can be viewed as an initial attempt towards quantitatively understanding the emerging behavior of LLMs as computational human models, given the well-known justification of equilibrium being a long-run emerging behavior of learning dynamics (Fudenberg and Levine, 1998 ) and strategic interactions (Young, 2004 ; Camerer, 2011 ) .

Transformers & In-context-learning.

LLMs nowadays are predominantly built upon the architecture of Transformers (Vaswani et al., 2017 ) . Transformers have exhibited a remarkable capacity of in-context-learning (ICL), which can construct new predictors from sequences of labeled examples as input, without further parameter updates. This has enabled the few-shot learning capability of Transformers (Brown et al., 2020 ; Garg et al., 2022 ; Min et al., 2022 ) . The empirical successes have inspired burgeoning theoretical studies on ICL. Xie et al. ( 2022 ) used a Bayesian inference framework to explain how ICL works, which has also been adopted in Wang et al. ( 2023b ); Jiang ( 2023 ) . Akyürek et al. ( 2023 ); Von Oswald et al. ( 2023 ); Dai et al. ( 2023 ); Giannou et al. ( 2023 ) showed (among other results) that ICL comes from the fact that Transformers can implement the gradient descent (GD) algorithm. Bai et al. ( 2023 ) further established that Transformers can implement a broad class of machine learning algorithms in context. Moreover, Ahn et al. ( 2023 ); Zhang et al. ( 2023a ); Mahankali et al. ( 2023 ) proved that a minimizer of the certain training loss among single-layer Transformers is equivalent to a single step of GD for linear regression. Li et al. ( 2023e ) established generalization bounds of ICL from a multi-task learning perspective. Zhang et al. ( 2023b ) argued that ICL implicitly implements Bayesian model averaging, and can be approximated by the attention mechanism. They also established a result on some regret metric. However, the regret notion is not defined for (online) decision-making, and is fundamentally different from ours that is standard in online learning and games. Also, we provide extensive experiments to validate the no-regret behavior by our definition. More recently, the ICL property has also been generalized to decision-making settings. Laskin et al. ( 2023 ); Lee et al. ( 2023 ); Lin et al. ( 2024 ) investigated the in-context reinforcement learning (RL) property of Transformers under supervised pre-training, for solving stochastic bandits and Markov decision processes. In contrast, our work focuses on online learning settings with an arbitrary and potentially adversarial nature, as well as game-theoretic settings. We also provide a new unsupervised loss to promote the no-regret behavior in our settings.

Online learning and games.

Online learning has been extensively studied to model the decision-making of an agent who interacts with the environment sequentially, with a potentially arbitrary sequence of loss functions (Shalev-Shwartz, 2012 ; Hazan, 2016 ) , and has a deep connection to game theory (Cesa-Bianchi and Lugosi, 2006 ) . In particular, regret, the difference between the incurred accumulated loss and the best-in-hindsight accumulated loss, has been the core performance metric, and a good online learning algorithm should have regret at most sublinear in time T 𝑇 T (i.e., of order o ( T ) 𝑜 𝑇 o(T) ), which is referred to as being no-regret . Many well-known algorithms can achieve no-regret against arbitrary loss sequences, e.g., multiplicative weight updates (MWU)/Hedge (Freund and Schapire, 1997 ; Arora et al., 2012 ) , EXP3 (Auer et al., 2002 ) , and more generally follow-the-regularized-leader (FTRL) (Shalev-Shwartz and Singer, 2007 ) and follow-the-perturbed-leader (FTPL) (Kalai and Vempala, 2005 ) . In the bandit literature (Lattimore and Szepesvári, 2020 ; Bubeck et al., 2012 ) , such a setting without any statistical assumptions on the losses is also referred to as the adversarial/non-stochastic setting. Following the conventions in this literature, the online settings we focus on shall not be confused with the stationary and stochastic (-bandit)/(-reinforcement learning) settings that have been explored in several other recent works on Transformers for decision-making (Lee et al., 2023 ; Lin et al., 2024 ) . Centering around the regret metric, our work has also explored the non-stationary bandit setting (Besbes et al., 2014 ) , as well as the repeated game setting where the environment itself consists of strategic agents (Cesa-Bianchi and Lugosi, 2006 ) .

2 Preliminaries

Notation.

We use \mathbb{N} and + superscript \mathbb{N}^{+} to denote the sets of non-negative and positive integers, respectively. For a finite set 𝒮 𝒮 {\mathcal{S}} , we use Δ ( 𝒮 ) Δ 𝒮 \Delta({\mathcal{S}}) to denote the simplex over 𝒮 𝒮 {\mathcal{S}} . For d + 𝑑 superscript d\in\mathbb{N}^{+} , we define [ d ] := { 1 , 2 , , d } assign delimited-[] 𝑑 1 2 𝑑 [d]:=\{1,2,\dots,d\} . For two vectors x , y d 𝑥 𝑦 superscript 𝑑 x,y\in\mathbb{R}^{d} , we use x , y 𝑥 𝑦 \langle x,y\rangle to denote the inner product of x 𝑥 x and y 𝑦 y . We define 𝟎 d subscript 0 𝑑 \boldsymbol{0}_{d} and 𝟏 d subscript 1 𝑑 \boldsymbol{1}_{d} as a d 𝑑 d -dimensional zero or one vector, and 𝑶 d × d subscript 𝑶 𝑑 𝑑 \boldsymbol{O}_{d\times d} and I d × d subscript 𝐼 𝑑 𝑑 I_{d\times d} as a d × d 𝑑 𝑑 d\times d -dimensional zero matrix and identity matrix, respectively. We omit d 𝑑 d when it is clear from the context. We define e i subscript 𝑒 𝑖 e_{i} as a unit vector (with proper dimension) whose i 𝑖 i -th coordinate equal to 1 1 1 . For p d , R > 0 formulae-sequence 𝑝 superscript 𝑑 𝑅 0 p\in\mathbb{R}^{d},R>0 and C d 𝐶 superscript 𝑑 C\subseteq\mathbb{R}^{d} is a convex set, define B ( p , R , ) := { x d x p R } B(p,R,\|\cdot\|):=\{x\in\mathbb{R}^{d}\mid\|x-p\|\leq R\} , Proj C , ( p ) = arg min x C x p \texttt{Proj}_{C,\|\cdot\|}(p)=\operatorname*{arg\,min}_{x\in C}\|x-p\| (which is well defined as C 𝐶 C is a convex set), and clip R ( x ) := [ Proj B ( 0 , R , 2 ) , 2 ( x i ) ] i [ d ] \texttt{clip}_{R}(x):=[\texttt{Proj}_{B(0,R,\|\cdot\|_{2}),\|\cdot\|_{2}}(x_{i})]_{i\in[d]} . Define Softmax ( x ) := ( e x i i [ d ] e x i ) i [ d ] assign Softmax 𝑥 subscript superscript 𝑒 subscript 𝑥 𝑖 subscript 𝑖 delimited-[] 𝑑 superscript 𝑒 subscript 𝑥 𝑖 𝑖 delimited-[] 𝑑 \texttt{Softmax}(x):=\left(\frac{e^{x_{i}}}{\sum_{i\in[d]}e^{x_{i}}}\right)_{i\in[d]} and ReLU ( x ) = max ( 0 , x ) ReLU 𝑥 0 𝑥 \texttt{ReLU}(x)=\max(0,x) for x d 𝑥 superscript 𝑑 x\in\mathbb{R}^{d} . For A m × n 𝐴 superscript 𝑚 𝑛 A\in\mathbb{R}^{m\times n} with A i subscript 𝐴 𝑖 A_{i} denoting its i 𝑖 i -th column, we define A op := max x 2 1 A x 2 assign subscript norm 𝐴 op subscript subscript norm 𝑥 2 1 subscript norm 𝐴 𝑥 2 \|A\|_{\texttt{op}}:=\max_{\|x\|_{2}\leq 1}\|Ax\|_{2} , A 2 , := sup i [ n ] A i 2 assign subscript norm 𝐴 2 subscript supremum 𝑖 delimited-[] 𝑛 subscript norm subscript 𝐴 𝑖 2 \|A\|_{2,\infty}:=\sup_{i\in[n]}\|A_{i}\|_{2} , A F subscript norm 𝐴 𝐹 \|A\|_{F} as the Frobenius norm, and A 1 := A n assign subscript 𝐴 1 subscript 𝐴 𝑛 A_{-1}:=A_{n} to denote the last column vector of A 𝐴 A . We define + := { x x 0 } assign superscript conditional-set 𝑥 𝑥 0 \mathbb{R}^{+}:=\{x\mid x\geq 0\} . For a set Π Π \Pi , define diam ( Π , ) := sup π 1 , π 2 Π π 1 π 2 \text{diam}(\Pi,\|\cdot\|):=\sup_{\pi_{1},\pi_{2}\in\Pi}\|\pi_{1}-\pi_{2}\| . We define 𝟙 ( ) := 1 assign 1 1 \mathbbm{1}(\mathcal{E}):=1 if \mathcal{E} is true, and 𝟙 ( ) := 0 assign 1 0 \mathbbm{1}(\mathcal{E}):=0 otherwise. For a random variable sequence ( X n ) n subscript subscript 𝑋 𝑛 𝑛 (X_{n})_{n\in\mathbb{N}} and random variables X , Y 𝑋 𝑌 X,Y , we denote F X subscript 𝐹 𝑋 F_{X} as the cumulative distribution function of a random variable X 𝑋 X , X n 𝑝 X subscript 𝑋 𝑛 𝑝 𝑋 X_{n}\overset{p}{\to}X if ϵ > 0 , lim n ( | X n X | > ϵ ) = 0 formulae-sequence for-all italic-ϵ 0 subscript 𝑛 subscript 𝑋 𝑛 𝑋 italic-ϵ 0 \forall\epsilon>0,\lim_{n\to\infty}\mathbb{P}(|X_{n}-X|>\epsilon)=0 , X n 𝑑 X subscript 𝑋 𝑛 𝑑 𝑋 X_{n}\overset{d}{\to}X if lim n F X n ( x ) = F X ( x ) subscript 𝑛 subscript 𝐹 subscript 𝑋 𝑛 𝑥 subscript 𝐹 𝑋 𝑥 \lim_{n\to\infty}F_{X_{n}}(x)=F_{X}(x) for all x 𝑥 x where F X ( x ) subscript 𝐹 𝑋 𝑥 F_{X}(x) is continuous, X = 𝑑 Y 𝑋 𝑑 𝑌 X\overset{d}{=}Y if F X ( x ) = F Y ( x ) subscript 𝐹 𝑋 𝑥 subscript 𝐹 𝑌 𝑥 F_{X}(x)=F_{Y}(x) for all x 𝑥 x , X n a . s . X X_{n}\overset{a.s.}{\to}X if ( lim n X n = X ) = 1 subscript 𝑛 subscript 𝑋 𝑛 𝑋 1 \mathbb{P}(\lim_{n\to\infty}X_{n}=X)=1 , and esssup ( X ) := inf { M : ( X > M ) = 0 } assign esssup 𝑋 infimum conditional-set 𝑀 𝑋 𝑀 0 \operatorname{esssup}(X):=\inf\{M\in\mathbb{R}:\mathbb{P}(X>M)=0\} . For a random variable X 𝑋 X , we use supp ( X ) supp 𝑋 \text{supp}(X) to denote its support. For functions f , g : : 𝑓 𝑔 f,g:\mathbb{R}\to\mathbb{R} , we define g ( x ) = 𝒪 ( f ( x ) ) 𝑔 𝑥 𝒪 𝑓 𝑥 g(x)=\mathcal{O}(f(x)) if there exist x 0 , M < subscript 𝑥 0 𝑀 x_{0},M<\infty such that | g ( x ) | M | f ( x ) | 𝑔 𝑥 𝑀 𝑓 𝑥 |g(x)|\leq M|f(x)| for all x > x 0 𝑥 subscript 𝑥 0 x>x_{0} . We use f superscript 𝑓 f^{\prime} to denote the derivative of f 𝑓 f . Let F : Ω : 𝐹 Ω F:\Omega\rightarrow\mathbb{R} be a continuously-differentiable, strictly convex function defined on a convex set Ω Ω \Omega . The Bregman divergence associated with F 𝐹 F for points p 𝑝 p , q 𝑞 q is defined as D F ( p , q ) := F ( p ) F ( q ) F ( q ) , p q assign subscript 𝐷 𝐹 𝑝 𝑞 𝐹 𝑝 𝐹 𝑞 𝐹 𝑞 𝑝 𝑞 D_{F}(p,q):=F(p)-F(q)-\langle\nabla F(q),p-q\rangle . For a sequence ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 (\ell_{t})_{t\in[T]} for some T + 𝑇 superscript T\in\mathbb{N}^{+} , we define a : b := ( a , , b ) assign subscript : 𝑎 𝑏 subscript 𝑎 subscript 𝑏 \ell_{a:b}:=(\ell_{a},\cdots,\ell_{b}) for 1 a b T 1 𝑎 𝑏 𝑇 1\leq a\leq b\leq T . If a > b 𝑎 𝑏 a>b , we define a : b = subscript : 𝑎 𝑏 \ell_{a:b}=\emptyset .

2.1 Online Learning & Games

Online learning.

We first consider the online learning setting where an agent interacts with the environment for T 𝑇 T rounds, by iteratively making decisions based on the feedback she receives. Specifically, at each time step t 𝑡 t , the agent chooses her decision policy π t Π subscript 𝜋 𝑡 Π \pi_{t}\in\Pi for some bounded domain Π Π \Pi , and after her commitment to π t subscript 𝜋 𝑡 \pi_{t} , a bounded loss function f t : Π [ B , B ] : subscript 𝑓 𝑡 Π 𝐵 𝐵 f_{t}:\Pi\to{[-B,B]} for some constant B > 0 𝐵 0 B>0 is chosen by the environment, potentially in an adversarial fashion. The agent thus incurs a loss of f t ( π t ) subscript 𝑓 𝑡 subscript 𝜋 𝑡 f_{t}(\pi_{t}) , and will update her decision to π t + 1 subscript 𝜋 𝑡 1 \pi_{t+1} using the feedback. We focus on the most basic setting where the agent chooses actions from a finite set 𝒜 𝒜 \mathcal{A} every round, which is also referred to as the Experts Problem (Littlestone and Warmuth, 1994 ; Hazan, 2016 ) , without loss of much generality (c.f. Section A.4 for a discussion). In this case, Π Π \Pi becomes the simplex over 𝒜 𝒜 \mathcal{A} , i.e., Π = Δ ( 𝒜 ) Π Δ 𝒜 \Pi=\Delta(\mathcal{A}) , and f t ( π t ) = t , π t subscript 𝑓 𝑡 subscript 𝜋 𝑡 subscript 𝑡 subscript 𝜋 𝑡 f_{t}(\pi_{t})=\langle{\ell}_{t},\pi_{t}\rangle for some loss vector t d subscript 𝑡 superscript 𝑑 {\ell}_{t}\in\mathbb{R}^{d} that may change over time, where d := | 𝒜 | assign 𝑑 𝒜 d:=|\mathcal{A}| .

At time step t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , the agent may receive either the full vector t subscript 𝑡 {\ell}_{t} , or only the realized loss t a t subscript 𝑡 subscript 𝑎 𝑡 {\ell}_{ta_{t}} (we sometimes also interchangeably write it as t ( a t ) subscript 𝑡 subscript 𝑎 𝑡 \ell_{t}(a_{t}) ), the a t subscript 𝑎 𝑡 a_{t} th element of t subscript 𝑡 \ell_{t} , for some a t π t ( ) similar-to subscript 𝑎 𝑡 subscript 𝜋 𝑡 a_{t}\sim\pi_{t}(\cdot) , as feedback, which will be referred to as online learning with full-information feedback , and that with bandit feedback , respectively. The latter is also referred to as the adversarial/non-stochastic bandit problem in the multi-armed bandit (MAB) literature. Note that hereafter, we will by default refer to this setting that does not make any assumptions on the loss sequence ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 ({\ell}_{t})_{t\in[T]} simply as online learning . Moreover, if the loss functions change over time (usually with certain bounded variation), we will refer to it as non-stationary online learning for short, whose bandit-feedback version is also referred to as the non-stationary bandit problem.

Repeated games.

The online learning setting above has an intimate connection to game theory. Consider a normal-form game 𝒢 = N , { 𝒜 n } n [ N ] , { r n } n [ N ] 𝒢 𝑁 subscript subscript 𝒜 𝑛 𝑛 delimited-[] 𝑁 subscript subscript 𝑟 𝑛 𝑛 delimited-[] 𝑁 \mathcal{G}=\langle N,\{\mathcal{A}_{n}\}_{n\in[N]},\{r_{n}\}_{n\in[N]}\rangle , where N 𝑁 N is the number of players, 𝒜 n subscript 𝒜 𝑛 \mathcal{A}_{n} and r n : 𝒜 1 × × 𝒜 N [ B , B ] : subscript 𝑟 𝑛 subscript 𝒜 1 subscript 𝒜 𝑁 𝐵 𝐵 r_{n}:\mathcal{A}_{1}\times\dots\times\mathcal{A}_{N}\rightarrow{[-B,B]} are the action set and the payoff function of player n 𝑛 n , respectively. The N 𝑁 N players repeatedly play the game for T 𝑇 T rounds, each player n 𝑛 n maintains a strategy π n , t Δ ( 𝒜 n ) subscript 𝜋 𝑛 𝑡 Δ subscript 𝒜 𝑛 \pi_{n,t}\in\Delta(\mathcal{A}_{n}) at time t 𝑡 t , and takes action a n , t π n , t ( ) similar-to subscript 𝑎 𝑛 𝑡 subscript 𝜋 𝑛 𝑡 a_{n,t}\sim\pi_{n,t}(\cdot) . The joint action a t = ( a 1 , t , , a N , t ) subscript 𝑎 𝑡 subscript 𝑎 1 𝑡 subscript 𝑎 𝑁 𝑡 a_{t}=(a_{1,t},\cdots,a_{N,t}) determines the payoff of each player at time t 𝑡 t , { r n ( a t ) } n [ N ] subscript subscript 𝑟 𝑛 subscript 𝑎 𝑡 𝑛 delimited-[] 𝑁 \{r_{n}(a_{t})\}_{n\in[N]} . From a single-player’s (e.g., player n 𝑛 n ’s) perspective, she encounters an online learning problem with (expected) loss function t := 𝔼 a n , t π n , t [ r n ( , a n , t ) ] assign subscript 𝑡 subscript 𝔼 similar-to subscript 𝑎 𝑛 𝑡 subscript 𝜋 𝑛 𝑡 delimited-[] subscript 𝑟 𝑛 subscript 𝑎 𝑛 𝑡 {\ell}_{t}:=-\mathbb{E}_{a_{-n,t}\sim\pi_{-n,t}}[r_{n}(\cdot,a_{-n,t})] at time t 𝑡 t , where n 𝑛 -n denotes the index for all the players other than player n 𝑛 n . We will refer to it as the game setting for short, and use the terms of “agent” and “player” interchangeably hereafter. The key difference between online learning and repeated games is in their interaction dynamics: online learning involves an agent facing a potentially adversarial, changing environment (or sequence of loss functions), while in repeated games, agents interact by playing the same game repeatedly, which might be less adversarial when they follow specific learning algorithms.

2.2 Performance Metric: Regret

We now introduce regret , the core performance metric used in online learning and games. For a given algorithm 𝒜 𝒜 \mathscr{A} , let π 𝒜 , t subscript 𝜋 𝒜 𝑡 \pi_{\mathscr{A},t} denote the decision policy of the agent at time t 𝑡 t generated by 𝒜 𝒜 \mathscr{A} . Then, the regret, which is the difference between the accumulated (expected) loss incurred by implementing 𝒜 𝒜 \mathscr{A} and that incurred by the best-in-hindsight fixed decision, can be defined as

Regret 𝒜 ( ( f t ) t [ T ] ) := t = 1 T f t ( π 𝒜 , t ) inf π Π t = 1 T f t ( π ) . assign subscript Regret 𝒜 subscript subscript 𝑓 𝑡 𝑡 delimited-[] 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝑓 𝑡 subscript 𝜋 𝒜 𝑡 subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 subscript 𝑓 𝑡 𝜋 \displaystyle\text{Regret}_{\mathscr{A}}\left((f_{t})_{t\in[T]}\right):=\sum_{t=1}^{T}f_{t}(\pi_{\mathscr{A},t})-\inf_{\pi\in\Pi}\sum_{t=1}^{T}f_{t}(\pi).

In the Experts Problem, the definition can be instantiated as Regret 𝒜 ( ( t ) t [ T ] ) := t = 1 T t , π 𝒜 , t inf π Π t = 1 T t , π assign subscript Regret 𝒜 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝑡 subscript 𝜋 𝒜 𝑡 subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝜋 \text{Regret}_{\mathscr{A}}(({\ell}_{t})_{t\in[T]}):=\sum_{t=1}^{T}\langle{\ell}_{t},\pi_{\mathscr{A},t}\rangle-\inf_{\pi\in\Pi}\sum_{t=1}^{T}\langle{\ell}_{t},\pi\rangle . With bandit-feedback, a commonly used metric may also take further expectation for Regret 𝒜 subscript Regret 𝒜 \text{Regret}_{\mathscr{A}} , over the randomness of the generated policies ( π 𝒜 , t ) t [ T ] subscript subscript 𝜋 𝒜 𝑡 𝑡 delimited-[] 𝑇 (\pi_{\mathscr{A},t})_{t\in[T]} . An algorithm 𝒜 𝒜 \mathscr{A} is referred to as being no-regret , if Regret 𝒜 ( ( f t ) t [ T ] ) o ( T ) similar-to subscript Regret 𝒜 subscript subscript 𝑓 𝑡 𝑡 delimited-[] 𝑇 𝑜 𝑇 \text{Regret}_{\mathscr{A}}((f_{t})_{t\in[T]})\sim o(T) , i.e., the regret grows sublinearly in T 𝑇 T . Widely-known no-regret algorithms include follow-the-regularized-leader (FTRL) (Shalev-Shwartz and Singer, 2007 ) , follow-the-perturbed-leader (Kalai and Vempala, 2005 ) (See Section A.3 for a detailed introduction).

In non-stationary online learning, one also uses the metric of dynamic regret (Zinkevich, 2003 ) , where the comparator in the definition also changes over time, as the best decision policy at each individual time t 𝑡 t : D-Regret 𝒜 ( ( f t ) t [ T ] ) := t = 1 T f t ( π 𝒜 , t ) t = 1 T inf π Π f t ( π ) assign subscript D-Regret 𝒜 subscript subscript 𝑓 𝑡 𝑡 delimited-[] 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝑓 𝑡 subscript 𝜋 𝒜 𝑡 superscript subscript 𝑡 1 𝑇 subscript infimum 𝜋 Π subscript 𝑓 𝑡 𝜋 \text{D-Regret}_{\mathscr{A}}((f_{t})_{t\in[T]}):=\sum_{t=1}^{T}f_{t}(\pi_{\mathscr{A},t})-\sum_{t=1}^{T}\inf_{\pi\in\Pi}f_{t}(\pi) , which is a stronger notion than Regret 𝒜 ( ( f t ) t [ T ] ) subscript Regret 𝒜 subscript subscript 𝑓 𝑡 𝑡 delimited-[] 𝑇 \text{Regret}_{\mathscr{A}}((f_{t})_{t\in[T]}) in that Regret 𝒜 ( ( f t ) t [ T ] ) D-Regret 𝒜 ( ( f t ) t [ T ] ) subscript Regret 𝒜 subscript subscript 𝑓 𝑡 𝑡 delimited-[] 𝑇 subscript D-Regret 𝒜 subscript subscript 𝑓 𝑡 𝑡 delimited-[] 𝑇 \text{Regret}_{\mathscr{A}}((f_{t})_{t\in[T]})\leq\text{D-Regret}_{\mathscr{A}}((f_{t})_{t\in[T]}) .

3 Do Pre-Trained LLMs Have Regret? Experimental Validation

In this section, we explore the no-regret behaviors of representative pre-trained LLMs (i.e., GPT-4 Turbo, GPT-4, and GPT-3.5 Turbo), in the context of online learning and games. All experiments with LLMs are conducted using the public OpenAI Python API (Openai, 2023 ) .

Intuition why pre-trained language models may exhibit no-regret behavior.

Transformer-based LLMs have demonstrated impressive in-context-learning and few-/zero-shot learning capabilities (Brown et al., 2020 ; Garg et al., 2022 ; Min et al., 2022 ) . One theoretical explanation is that, trained Transformers can implement the gradient descent algorithm on the testing loss in certain supervised learning problems (Akyürek et al., 2023 ; Von Oswald et al., 2023 ; Dai et al., 2023 ; Ahn et al., 2023 ; Zhang et al., 2023a ; Mahankali et al., 2023 ) , which is inherently adaptive to the loss function used at test time. On the other hand, it is known in online learning that the simple algorithm of online gradient descent (Zinkevich, 2003 ) can achieve no-regret. Hence, it seems reasonable to envision the no-regret behavior of such meta-learners in online learning, due to their fast adaptability. However, it is not straightforward due to the fundamental difference between multi-task/meta-learning and online learning settings, as well as the difference between stationary and non-stationary/adversarial environments in decision-making. Next, we provide both experimental and theoretical studies to validate this intuition.

Interaction protocol.

To enable the sequential interactions with LLMs, we first describe the setup and objective of our experimental study. At each round, we incorporate the entire history of loss vectors of past interactions into our prompts, as concatenated texts, and ask the LLM agent to determine a policy that guides the decision-making for the next round. Note that since we hope to evaluate the intelligence level of pre-trained LLMs through online learning or games, we only provide simple prompts that she should utilize the history information, without providing explicit rules of how to make use of the history information, nor asking her to minimize regret (in any sense). A detailed description and an ablation study of the prompts are deferred to Section B.1 , and an illustration of the protocol in playing repeated games is given in Figure 3 .

3.1 Framework for No-Regret Behavior Validation

Before delving into the results, we note that to the best of our knowledge, we are not aware of any principled framework for validating no-regret behaviors with finite-time experimental data. Therefore, we propose two frameworks to rigorously validate the no-regret behavior of algorithms over a finite T 𝑇 T , which might be of independent interest.

Trend-checking framework.

We propose the following hypothesis test:

H 0 subscript 𝐻 0 \displaystyle H_{0} : The sequence ( Regret 𝒜 ( ( f τ ) τ [ t ] ) / t ) t = 1 either diverges or converges to a positive constant : absent The sequence superscript subscript subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 𝑡 𝑡 1 either diverges or converges to a positive constant \displaystyle:\text{The sequence }\left(\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)/t\right)_{t=1}^{\infty}\text{ either diverges or converges to a positive constant}
H 1 subscript 𝐻 1 \displaystyle H_{1} : The sequence ( Regret 𝒜 ( ( f τ ) τ [ t ] ) / t ) t = 1 converges to 0 : absent The sequence superscript subscript subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 𝑡 𝑡 1 converges to 0 \displaystyle:\text{The sequence }\left(\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)/t\right)_{t=1}^{\infty}\text{ converges to 0}

with H 0 subscript 𝐻 0 H_{0} and H 1 subscript 𝐻 1 H_{1} denoting the null and alternative hypotheses, respectively. The notion of convergence is related to T 𝑇 T\to\infty by definition, making it challenging to verify directly. As an alternative, we propose a more tractable hypothesis test, albeit a weaker one, that still captures the essence of our objective:

H 0 subscript 𝐻 0 \displaystyle H_{0} : The sequence ( Regret 𝒜 ( ( f τ ) τ [ t ] ) / t ) t [ T ] does not exhibit a decreasing trend : absent The sequence subscript subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 𝑡 𝑡 delimited-[] 𝑇 does not exhibit a decreasing trend \displaystyle:\text{The sequence }\left(\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)/t\right)_{t\in[T]}\text{ does not exhibit a decreasing trend}
H 1 subscript 𝐻 1 \displaystyle H_{1} : The sequence ( Regret 𝒜 ( ( f τ ) τ [ t ] ) / t ) t [ T ] shows a decreasing trend . : absent The sequence subscript subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 𝑡 𝑡 delimited-[] 𝑇 shows a decreasing trend \displaystyle:\text{The sequence }\left(\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)/t\right)_{t\in[T]}\text{ shows a decreasing trend}.

Ideally, one should check if Regret 𝒜 ( ( f τ ) τ [ t ] ) / t subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 𝑡 \text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)/t approaches zero as t 𝑡 t goes to infinity. With a finite T 𝑇 T value, testing these hypotheses provides a method to quantify this – whether we reject H 0 subscript 𝐻 0 H_{0} offers a way to measure it. To this end, one needs to count the number of Regret 𝒜 ( ( f τ ) τ [ t ] ) / t Regret 𝒜 ( ( f τ ) τ [ t + 1 ] ) / ( t + 1 ) > 0 subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 𝑡 subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 1 𝑡 1 0 \text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)/t-\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t+1]}\right)/(t+1)>0 , for which we use Proposition 1 below (whose proof is deferred to Section B.2 ) to provide some understanding of (how small) the probability it happens under various counts. For example, with the default choice of T = 25 𝑇 25 T=25 in our experiments later, one can see from Proposition 1 that: H 0 ( ( 17 , 25 ) ) < 0.032 , H 0 ( ( 19 , 25 ) ) < 0.0035 , H 0 ( ( 21 , 25 ) ) < 0.00014 formulae-sequence subscript subscript 𝐻 0 17 25 0.032 formulae-sequence subscript subscript 𝐻 0 19 25 0.0035 subscript subscript 𝐻 0 21 25 0.00014 \mathbb{P}_{H_{0}}(\mathcal{E}(17,25))<0.032,\mathbb{P}_{H_{0}}(\mathcal{E}(19,25))<0.0035,\mathbb{P}_{H_{0}}(\mathcal{E}(21,25))<0.00014 , i.e., one can easily reject H 0 subscript 𝐻 0 H_{0} with high probability. We will report the p 𝑝 p -value of H 0 subscript 𝐻 0 H_{0} , denoted as p t r e n d subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 p_{trend} , as the output of this framework.

Proposition 1 .

( p 𝑝 p -value of the null hypothesis) . Define the event

( s , T ) := { The number of Regret 𝒜 ( ( f τ ) τ [ t ] ) t Regret 𝒜 ( ( f τ ) τ [ t + 1 ] ) t + 1 > 0 for t = 1 , , T is at least s T 1 2 } . \displaystyle\mathcal{E}(s,T):=\left\{\text{The~{}number~{}of~{}}\frac{\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)}{t}-\frac{\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t+1]}\right)}{t+1}>0\text{~{}~{}for~{}~{}}t=1,\dots,T\text{~{}is~{}at~{}least~{}~{}}s\geq\frac{T-1}{2}\right\}.

Under the assumption that the null hypothesis H 0 subscript 𝐻 0 H_{0} holds, the probability of this event happening is bounded as H 0 ( ( s , T ) ) 1 2 T 1 t = s T 1 ( T 1 t ) subscript subscript 𝐻 0 𝑠 𝑇 1 superscript 2 𝑇 1 superscript subscript 𝑡 𝑠 𝑇 1 matrix 𝑇 1 𝑡 \mathbb{P}_{H_{0}}(\mathcal{E}(s,T))\leq\frac{1}{2^{T-1}}\sum_{t=s}^{T-1}\begin{pmatrix}T-1\\ t\end{pmatrix} .

Regression-based framework.

In complement to the statistical framework above, we propose an alternative approach by fitting the data with regression. In particular, one can use the data

{ ( t , log Regret 𝒜 ( ( f τ ) τ [ t ] ) ) } t [ T ] subscript 𝑡 subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\left\{\left(t,\log\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)\right)\right\}_{t\in[T]}

to fit a linear function g ( t ) = β 0 log t + β 1 𝑔 𝑡 subscript 𝛽 0 𝑡 subscript 𝛽 1 g(t)=\beta_{0}\log t+\beta_{1} , where the estimate of β 0 subscript 𝛽 0 \beta_{0} , i.e., β ^ 0 subscript ^ 𝛽 0 \widehat{\beta}_{0} , satisfying β ^ 0 < 1 subscript ^ 𝛽 0 1 \widehat{\beta}_{0}<1 may be used to indicate the no-regret behavior, i.e., the sublinear growth of Regret 𝒜 ( ( f τ ) τ [ t ] ) subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 \text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right) over time. While being simple, it cannot be directly used when Regret 𝒜 ( ( f τ ) τ [ t ] ) < 0 subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 0 \text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)<0 , so we set log Regret 𝒜 ( ( f τ ) τ [ t ] ) subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 \log\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right) as 10 10 -10 if this happens. We define p r e g subscript 𝑝 𝑟 𝑒 𝑔 p_{reg} as the p 𝑝 p -value of the regression parameter β ^ 0 subscript ^ 𝛽 0 \widehat{\beta}_{0} , and will report the pair of ( β ^ 0 , p r e g ) subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 (\widehat{\beta}_{0},p_{reg}) as the output of this framework.

3.2 Results: Online Learning

We now present the experimental results on the no-regret behavior of pre-trained LLMs in online learning in: 1) arbitrarily changing environments, 2) non-stationary environments, and 3) bandit-feedback environments.

Online learning in arbitrarily changing environment.

We first consider the setting with arbitrarily changing environments, which are instantiated as follows: 1) Randomly-generated loss sequences . At every timestep, we generate a random loss vector t Unif ( [ 0 , 10 ] d ) similar-to subscript 𝑡 Unif superscript 0 10 𝑑 {\ell}_{t}\sim\operatorname{Unif}([0,10]^{d}) or t 𝒩 ( 5 𝟏 d , I ) similar-to subscript 𝑡 𝒩 5 subscript 1 𝑑 𝐼 {\ell}_{t}\sim\mathcal{N}(5\boldsymbol{1}_{d},I) with clipping to [ 0 , 10 ] 0 10 [0,10] to ensure the boundedness, such that the loss vectors of different timesteps can be arbitrarily different; 2) Loss sequences with a predictable trend. Although many real-world environments may change, they often change following certain patterns. Therefore, we consider two representative trends, a linear trend and a periodic (sinusoid) trend. For the linear trend, we sample a , b Unif ( [ 0 , 10 ] d ) similar-to 𝑎 𝑏 Unif superscript 0 10 𝑑 a,b\sim\text{Unif}([0,10]^{d}) and let t = ( b a ) t T + a subscript 𝑡 𝑏 𝑎 𝑡 𝑇 𝑎 \ell_{t}=(b-a)\frac{t}{T}+a for each t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] . For the periodic trend, we sample a , b Unif ( [ 0 , 10 ] d ) similar-to 𝑎 𝑏 Unif superscript 0 10 𝑑 a,b\sim\text{Unif}([0,10]^{d}) and let t = 5 ( 1 + sin ( a t + b ) ) subscript 𝑡 5 1 𝑎 𝑡 𝑏 \ell_{t}=5(1+\sin(at+b)) for each t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] . In the experiments, we choose d = 2 𝑑 2 d=2 . The average regret (over multiple randomly generated instances) performance is presented in Figure 1 , where we compare GPT-4 with well-known no-regret algorithms, FTRL with entropy regularization and FTPL with Gaussian perturbations (with tuned parameters). It is seen that these pre-trained LLMs can indeed achieve no-regret and often have smaller regrets than these baselines.

Online learning (in non-stationary environment).

We then experiment on the setting where the losses are still changing over time, but their total variations across time are bounded, more concretely, sublinear in T 𝑇 T . Correspondingly, we consider the stronger metric of dynamic regret here to measure the performance. Note that without constraining the variation of the loss vectors, dynamic regret can be linear w.r.t. T 𝑇 T in the worst case. Hence, we generate the loss vectors in two different ways: 1) Gradual variation. We firstly sample 1 Unif ( [ 0 , 10 ] d ) similar-to subscript 1 Unif superscript 0 10 𝑑 {\ell}_{1}\sim\text{Unif}([0,10]^{d}) . Then for each t 2 𝑡 2 t\geq 2 , we uniformly and randomly generate t + 1 subscript 𝑡 1 {\ell}_{t+1} under the constraint t + 1 t 1 t subscript norm subscript 𝑡 1 subscript 𝑡 1 𝑡 \|{\ell}_{t+1}-{\ell}_{t}\|_{\infty}\leq\frac{1}{\sqrt{t}} , such that the variations over time are guaranteed to satisfy t = 1 T 1 t + 1 t = o ( T ) superscript subscript 𝑡 1 𝑇 1 subscript norm subscript 𝑡 1 subscript 𝑡 𝑜 𝑇 \sum_{t=1}^{T-1}\|{\ell}_{t+1}-{\ell}_{t}\|_{\infty}={o(T)} ; 2) Abrupt variation. We randomly generate 1 Unif ( [ 0 , 10 ] d ) similar-to subscript 1 Unif superscript 0 10 𝑑 {\ell}_{1}\sim\text{Unif}([0,10]^{d}) and m 𝑚 m time indices { t i } i [ m ] subscript subscript 𝑡 𝑖 𝑖 delimited-[] 𝑚 \{t_{i}\}_{i\in[m]} from { 1 , 2 , , T } 1 2 𝑇 \{1,2,\cdots,T\} . At each time step t i subscript 𝑡 𝑖 t_{i} for i [ m ] 𝑖 delimited-[] 𝑚 i\in[m] , the sign of the loss vector t i subscript subscript 𝑡 𝑖 {\ell}_{t_{i}} is flipped, i.e., we let t i 10 𝟏 d t i subscript subscript 𝑡 𝑖 10 subscript 1 𝑑 subscript subscript 𝑡 𝑖 {\ell}_{t_{i}}\leftarrow 10\boldsymbol{1}_{d}-{\ell}_{t_{i}} . For the specific choice of T = 25 𝑇 25 T=25 in our experiments, we choose m = 3 𝑚 3 m=3 . For both cases, the average dynamic regret results are presented in Table 1 . It can be seen that GPT-4 achieves sublinear dynamic regret and outperforms FTRL/FTPL with Restart , a standard variant of FTRL/FTPL for non-stationary online learning (see e.g., Besbes et al. ( 2014 ) ). We refer to Section A.3 for a more detailed introduction of FTRL/FTPL with Restart.

Extension to bandit-feedback settings.

Although pre-trained LLMs have achieved good performance in online learning with full-information feedback, it is unclear whether they can still maintain no-regret with only bandit feedback. For such problems, we modify the prompt and protocol of interactions slightly, where we still ask the LLM agent to provide a policy π t subscript 𝜋 𝑡 \pi_{t} at time step t 𝑡 t , then sample one a t π t ( ) similar-to subscript 𝑎 𝑡 subscript 𝜋 𝑡 a_{t}\sim\pi_{t}(\cdot) . In the bandit setting, the LLM agent can only access ( a t , t a t ) subscript 𝑎 𝑡 subscript 𝑡 subscript 𝑎 𝑡 (a_{t},\ell_{ta_{t}}) . Instead of directly feeding it to the agent, we feed an estimate of the loss vector ^ t d subscript ^ 𝑡 superscript 𝑑 \widehat{\ell}_{t}\in\mathbb{R}^{d} , where ^ t ( a ) t ( a ) π t ( a ) 𝟙 ( a t = a ) subscript ^ 𝑡 𝑎 subscript 𝑡 𝑎 subscript 𝜋 𝑡 𝑎 1 subscript 𝑎 𝑡 𝑎 \widehat{\ell}_{t}(a)\leftarrow\frac{\ell_{t}(a)}{\pi_{t}(a)}\mathbbm{1}(a_{t}=a) for all j [ d ] 𝑗 delimited-[] 𝑑 j\in[d] . Note that such an operation of re-weighting the loss by the inverse of the probability is standard in online learning when adapting full-information-feedback no-regret algorithms to the bandit-feedback ones (Auer et al., 2002 ) . Later, we will also show the benefits of such operations (c.f. Section 4 ). We compare the performance of pre-trained LLMs with that of the counterparts of FTRL with bandit feedback, e.g., EXP3 (Auer et al., 2002 ) and the bandit-version of FTPL (Abernethy et al., 2015 ) , in both Figure 2 and Table 1 , where GPT-4 consistently achieves lower regret.

Refer to caption
Figure 1 : Regret of GPT-3.5 Turbo/GPT-4 for online learning with full-information feedback in 4 4 4 different settings. It performs comparably and sometimes even better than well-known no-regret learning algorithms, FTRL and FTPL.
Refer to caption
Figure 2 : Regret of GPT-3.5 Turbo/GPT-4 for online learning with bandit feedback in 4 4 4 different settings. It performs comparably and sometimes even better than well-known no-regret learning algorithms, variants of FTRL and FTPL with bandit-feedback.
Refer to caption
Figure 3 : Demonstration of the prompts and interaction protocol for multi-player repeated games. A human moderator does not provide the game’s payoff matrices to the LLMs. Instead, at each round, the human moderator provides each player’s own payoff vector history.

3.3 Results: Multi-Player Repeated Games

We now consider the setting when multiple LLMs make online strategic decisions in a shared environment repeatedly. Specifically, at each round, the loss vectors each agent receives are determined by both her payoff matrix and the strategies of all other agents. Note that the payoff matrix is not directly revealed to the LLM agent, but she has to make decisions in a completely online fashion based on the payoff vector marginalized by the opponents’ strategies (See Figure 3 for a prompt example). This is a typical scenario in learning in (repeated) games (Fudenberg and Levine, 1998 ) .

Representative games.

We first test the repeated play of pre-trained LLMs on 6 6 6 representative general-sum games ( win-win, prisoner’s dilemma, unfair, cyclic, biased, and second best ) studied in Robinson and Goforth ( 2005 ) (see a detailed introduction of these games in Section A.5 ). For each type of the game, we conduct 20 20 20 repeated experiments.

Randomly generated games.

To further validate the no-regret behavior of LLMs, we also test on 50 50 50 randomly generated three-player general-sum games, and 50 50 50 randomly generated four-player general-sum games, where each entry of the payoff matrix is sampled randomly from Unif ( [ 0 , 10 ] ) Unif 0 10 \operatorname{Unif}([0,10]) . These are larger and more challenging settings than the structured and representative ones above.

We summarize the experimental results in Figure 4 , which are similar to the above in the online setting: for all types of games, GPT-4 achieves sublinear regret, which is comparable with that obtained by FTRL for most games. We provide six instances of three-player general-sum games and six instances of four-player general-sum games in Figure 5 and Figure 6 , respectively. Occasionally, GPT-4 even provides a negative regret value.

Refer to caption
Refer to caption
Refer to caption
Figure 4 : Regret of GPT-3.5 Turbo/GPT-4 for repeated games of 3 3 3 different game sizes, where the sublinear regret is validated by both of our statistical frameworks. Due to the symmetry of agents in randomly generated games, we report the regret of one agent for ease of presentation.
Refer to caption
Figure 5 : Regret of GPT-4 and the FTRL algorithm in 6 6 6 randomly generated three-player general-sum games. GPT-4 has comparable (even better) no-regret properties when compared with the FTRL algorithm.
Refer to caption
Figure 6 : Regret of GPT-4 and the FTRL algorithm in 6 6 6 randomly generated four-player general-sum games. GPT-4 has comparable (even better) no-regret properties when compared with the FTRL algorithm, according to the frameworks in Section 3.1 and the graphic trends.
Refer to caption
Refer to caption
Refer to caption
Figure 7 : (left) Regret of both GPT-4 and GPT-4 Turbo under the canonical counterexample for FTL (Hazan, 2016 , Chapter 5) . (mid, right) Failure of GPT-4 and GPT-4 Turbo on two scenarios for regrettable behavior of GPT, while Transformers with regret-loss provide no-regret behaviors.
Dynamic regret GPT-4 GPT-3.5 Turbo FTRL FTPL
Full information Gradual variation
12.61 ± 7.01 plus-or-minus 12.61 7.01 12.61\pm 7.01
( p t r e n d , β ^ 0 , p r e g ) = ( 0.0 , 0.58 , 0.0 ) subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 0.0 0.58 0.0 (p_{trend},\widehat{\beta}_{0},p_{reg})=(0.0,0.58,0.0)
19.09 ± 11.33 plus-or-minus 19.09 11.33 19.09\pm 11.33
( p t r e n d , β ^ 0 , p r e g ) = ( 0.0 , 0.83 , 0.0 ) subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 0.0 0.83 0.0 (p_{trend},\widehat{\beta}_{0},p_{reg})=(0.0,0.83,0.0)
36.58 ± 24.51 plus-or-minus 36.58 24.51 36.58\pm 24.51 35.19 ± 22.51 plus-or-minus 35.19 22.51 35.19\pm 22.51
Abrupt variation
30.0 ± 19.91 plus-or-minus 30.0 19.91 30.0\pm 19.91
( p t r e n d , β ^ 0 , p r e g ) = ( 0.01 , 0.87 , 0.0 ) subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 0.01 0.87 0.0 (p_{trend},\widehat{\beta}_{0},p_{reg})=(0.01,0.87,0.0)
33.65 ± 22.51 plus-or-minus 33.65 22.51 33.65\pm 22.51
( p t r e n d , β ^ 0 , p r e g ) = ( 0.08 , 0.96 , 0.0 ) subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 0.08 0.96 0.0 (p_{trend},\widehat{\beta}_{0},p_{reg})=(0.08,0.96,0.0)
36.52 ± 27.68 plus-or-minus 36.52 27.68 36.52\pm 27.68 36.24 ± 28.22 plus-or-minus 36.24 28.22 36.24\pm 28.22
Bandit Gradual variation
21.39 ± 10.86 plus-or-minus 21.39 10.86 21.39\pm 10.86
( p t r e n d , β ^ 0 , p r e g ) = ( 0.0 , 0.78 , 0.0 ) subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 0.0 0.78 0.0 (p_{trend},\widehat{\beta}_{0},p_{reg})=(0.0,0.78,0.0)
28.42 ± 21.6 plus-or-minus 28.42 21.6 28.42\pm 21.6
( p t r e n d , β ^ 0 , p r e g ) = ( 0.0 , 0.83 , 0.0 ) subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 0.0 0.83 0.0 (p_{trend},\widehat{\beta}_{0},p_{reg})=(0.0,0.83,0.0)
37.64 ± 21.97 plus-or-minus 37.64 21.97 37.64\pm 21.97 36.37 ± 20.7 plus-or-minus 36.37 20.7 36.37\pm 20.7
Abrupt variation
35.94 ± 28.93 plus-or-minus 35.94 28.93 35.94\pm 28.93
( p t r e n d , β ^ 0 , p r e g ) = ( 0.42 , 0.95 , 0.0 ) subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 0.42 0.95 0.0 (p_{trend},\widehat{\beta}_{0},p_{reg})=(0.42,0.95,0.0)
30.76 ± 25.48 plus-or-minus 30.76 25.48 30.76\pm 25.48
( p t r e n d , β ^ 0 , p r e g ) = ( 0.92 , 1.01 , 0.0 ) subscript 𝑝 𝑡 𝑟 𝑒 𝑛 𝑑 subscript ^ 𝛽 0 subscript 𝑝 𝑟 𝑒 𝑔 0.92 1.01 0.0 (p_{trend},\widehat{\beta}_{0},p_{reg})=(0.92,1.01,0.0)
36.52 ± 27.68 plus-or-minus 36.52 27.68 36.52\pm 27.68 38.82 ± 26.17 plus-or-minus 38.82 26.17 38.82\pm 26.17
Table 1 : Dynamic regret of GPT-3.5 Turbo/GPT-4 in a non-stationary environment with either full-information or bandit feedback. Every experiment is conducted with 25 rounds. No-regret behaviors of GPT-3.5 Turbo/GPT-4 are validated by both of our frameworks (low p 𝑝 p -values and β ^ 0 < 1 subscript ^ 𝛽 0 1 \widehat{\beta}_{0}<1 ). The only exception is GPT-3.5 Turbo on loss sequence with abrupt variations under bandit feedback. This indicates that GPT-3.5 Turbo may not be capable of dealing with an abruptly changing environment with limited feedback, although the average regret achieved eventually is still lower than that of other baselines.

3.4 Pre-Trained LLM Agents May Still Have Regret

It seems tempting to conclude that pre-trained LLMs can achieve no-regret in both online learning and playing repeated games. However, is this capability universal ? We show that the no-regret property might break for LLM agents if the loss vectors are generated in a more adversarial way.

Canonical counterexample for follow-the-leader.

To begin with, we consider a well-known example that follow-the-leader (FTL) algorithm (Shalev-Shwartz, 2012 ) suffers from linear regret (Hazan, 2016 , Chapter 5) , where 1 ( 1 ) = 5 , 1 ( 2 ) = 0 formulae-sequence subscript 1 1 5 subscript 1 2 0 \ell_{1}(1)=5,\ell_{1}(2)=0 and t ( 2 t % 2 ) = 10 , t ( 1 + t % 2 ) = 0 formulae-sequence subscript 𝑡 2 percent 𝑡 2 10 subscript 𝑡 1 percent 𝑡 2 0 \ell_{t}(2-t\%2)=10,\ell_{t}(1+t\%2)=0 for t 2 𝑡 2 t\geq 2 , where % percent \% is the modulo operation. Interestingly, GPT-4 agent can easily identify the pattern for the loss sequence that the optimal action alternates , thus accurately predicting the loss it will receive and achieving nearly zero regret in Figure 7 . In other words, GPT-4 agent seems to not fail in the same way as FTL, which is known to be due to the lack of randomness in prediction.

Less predictable loss sequence.

Inspired by the counterexample above, we design a new loss sequence that is similar but less predictable . Specifically, we construct the following (simple) loss sequence with 2 2 2 actions such that t ( 1 + t % 2 ) = min ( 25 / t , 10 ) , t ( 2 t % 2 ) Unif ( [ 9 , 10 ] ) formulae-sequence subscript 𝑡 1 percent 𝑡 2 25 𝑡 10 similar-to subscript 𝑡 2 percent 𝑡 2 Unif 9 10 {\ell}_{t}(1+t\%2)=\min(25/t,10),{\ell}_{t}(2-t\%2)\sim\text{Unif}([9,10]) for t [ 25 ] 𝑡 delimited-[] 25 t\in[25] .

Adaptive loss sequence.

We also develop a simpler but more adaptive loss sequence that takes the full power of the adversary in the online learning setup. After the GPT-4 agent provides π t subscript 𝜋 𝑡 \pi_{t} , we choose t subscript 𝑡 {\ell}_{t} such that t ( arg max i π t i ) = 10 subscript 𝑡 subscript arg max 𝑖 subscript 𝜋 𝑡 𝑖 10 {\ell}_{t}(\operatorname*{arg\,max}_{i}\pi_{ti})=10 and t ( 3 arg max i π t i ) = 0 subscript 𝑡 3 subscript arg max 𝑖 subscript 𝜋 𝑡 𝑖 0 {\ell}_{t}(3-\operatorname*{arg\,max}_{i}\pi_{ti})=0 . We report the average regret over 20 20 20 repeated experiments for the later two settings using GPT-4 and more advanced GPT-4 Turbo in Figure 7 , where we cannot reject the hypothesis that GPT-4 (Turbo) has linear-regret by either our trend-checking or regression-based framework.

These observations have thus motivated us to design new approaches to better promote the no-regret property of LLM agents, with additional training, as to be detailed in Section 5 . Before delving into the design of such a stronger LLM agent, we first provide some theoretical insights into why pre-trained LLMs have already exhibited good no-regret behaviors oftentimes.

4 Why Are Pre-Trained LLMs (No-)Regret? A Hypothetical Model and Some Theoretical Insights

We now provide some plausible explanations about the no-regret behavior of pre-trained LLMs, as observed in Sections 3.2 and 3.3 . Note that our explanations have to be hypothetical by nature, since to the best of our knowledge, the details of pre-training these popular LLMs (e.g., GPT-3.5 Turbo and GPT-4), regarding data distribution, training algorithm, etc., have not been revealed. We instead make the explanations based on some common assumptions and arguments in the literature for modeling human behaviors, and the recent literature on understanding LLMs/Transformers.

4.1 Pre-Trained LLMs Have Similar Regret as Humans (Who Generate Data)

We first provide a direct observation based on some existing speculation on the capability of Transformer-based LLMs. Recently, a growing literature has evidenced that the intelligence level of LLM agents are determined by, and in fact mimic, those of human beings who generate the data for pre-training the models (Park et al., 2022 ; Argyle et al., 2023 ; Horton, 2023 ) . The key rationale was that, LLMs (with Transformer parameterization) can approximate the pre-training data distribution very well (Xie et al., 2022 ; Zhang et al., 2023b ; Lee et al., 2023 ) . In such a context, one can expect that LLM agents can achieve similar regret as human decision-makers who generate the pre-training data, as we formally state below.

Observation 1 .

An LLM agent is said to be pre-trained with an ϵ italic-ϵ \epsilon -decision error if, for any arbitrary t 𝑡 t and loss sequences ( i ) i [ t ] subscript subscript 𝑖 𝑖 delimited-[] 𝑡 ({\ell}_{i})_{i\in[t]} , the following condition holds:

sup π Π | P data ( π | ( i ) i [ t ] ) P LLM ( π | ( i ) i [ t ] ) | ϵ , \displaystyle\sup_{\pi\in\Pi}\quad\big{|}P_{\text{data}}(\pi{\,|\,}({\ell}_{i})_{i\in[t]})-P_{\text{LLM}}(\pi{\,|\,}({\ell}_{i})_{i\in[t]})\big{|}\leq\epsilon,

where P data subscript 𝑃 data P_{\text{data}} and P LLM subscript 𝑃 LLM P_{\text{LLM}} are the pre-training data distribution and the decision policy distribution of the pre-trained LLM, respectively. Then, the regret of an LLM agent with ϵ italic-ϵ \epsilon -decision error is bounded as:

(D-)Regret LLM ( ( t ) t [ T ] ) [ (D-)Regret data ( ( t ) t [ T ] ) ± ϵ t sup π Π π ] , subscript (D-)Regret LLM subscript subscript 𝑡 𝑡 delimited-[] 𝑇 delimited-[] plus-or-minus subscript (D-)Regret data subscript subscript 𝑡 𝑡 delimited-[] 𝑇 italic-ϵ norm subscript 𝑡 subscript supremum 𝜋 Π norm 𝜋 \displaystyle\text{(D-)Regret}_{\text{LLM}}\left(({\ell}_{t})_{t\in[T]}\right)\in\left[\text{(D-)Regret}_{\text{data}}\left(({\ell}_{t})_{t\in[T]}\right)\pm\epsilon\|{\ell}_{t}\|\sup_{\pi\in\Pi}\|\pi\|\right],

where [ a ± b ] := [ a b , a + b ] assign delimited-[] plus-or-minus 𝑎 𝑏 𝑎 𝑏 𝑎 𝑏 [a\pm b]:=[a-b,a+b] .

1 shows that the pre-trained LLM-agent’s regret can be controlled by that of the pre-training dataset and the decision error ϵ italic-ϵ \epsilon . A small ϵ italic-ϵ \epsilon can be achieved if LLM is constructed by a rich function class, e.g., the Transformer architecture (Zhang et al., 2023b ; Lin et al., 2024 ) . We defer a proof of the result to Appendix C .

Note that the above observation does not provide the full picture – the (D-)Regret data ( ( t ) t [ T ] ) subscript (D-)Regret data subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \text{(D-)Regret}_{\text{data}}\left(({\ell}_{t})_{t\in[T]}\right) , which denotes the (dynamic-)regret of using P data ( π | ( i ) i [ t ] ) subscript 𝑃 data conditional 𝜋 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 P_{\text{data}}\left(\pi{\,\big{|}\,}({\ell}_{i})_{i\in[t]}\right) for online learning, is not necessarily sublinear in T 𝑇 T . We discuss next under what (natural) models and pre-training processes, the no-regret behavior of pre-trained LLMs may emerge.

4.2 A Human Decision-Making Model: Quantal Response

A seminal model for human decision-making behavior is the quantal response model, which assumes that humans are often imperfect decision-makers, and their bounded rationality can be modeled through unseen latent variables that influence the decision-making process (McFadden, 1976 ; McKelvey and Palfrey, 1995 ) . Formally, the quantal response is defined as follows:

Definition 1 (Quantal response) .

Given a loss vector d superscript 𝑑 \ell\in\mathbb{R}^{d} , a noise distribution ϵ P noise similar-to italic-ϵ subscript 𝑃 noise \epsilon\sim P_{\text{noise}} , and η > 0 𝜂 0 \eta>0 , the quantal response is defined as

P quantal η ( a | ) = ( a arg min a 𝒜 z ( a ) ) , where z = + η ϵ . superscript subscript 𝑃 quantal 𝜂 conditional 𝑎 𝑎 subscript arg min superscript 𝑎 𝒜 𝑧 superscript 𝑎 where z = + η ϵ . \displaystyle P_{\text{quantal}}^{\eta}\big{(}a{\,\big{|}\,}\ell\big{)}=\mathbb{P}\left(a\in\operatorname*{arg\,min}_{a^{\prime}\in\mathcal{A}}~{}~{}z(a^{\prime})\right),\qquad\text{where~{}~{} $z=\ell+\eta\epsilon$.}

In essence, this implies that humans are rational but with respect to (w.r.t.) the latent variable z 𝑧 z , a perturbed version of \ell , instead of \ell per se. This addition of noise to the actual loss vector characterizes the bounded rationality of humans in decision-making.

The traditional quantal response formulation primarily focused on scenarios with a single loss vector. In online decision-making, given the history information, the human at each time t 𝑡 t is faced with multiple loss vectors. Hence, we adopt the following generalization to model the decision-making behavior in this setting.

Definition 2 (Quantal response against multiple losses) .

Given a set of loss vectors ( i ) i [ t ] subscript subscript 𝑖 𝑖 delimited-[] 𝑡 ({\ell}_{i})_{i\in[t]} , a noise distribution P noise subscript 𝑃 noise P_{\text{noise}} , and η t > 0 subscript 𝜂 𝑡 0 \eta_{t}>0 , the generalized quantal response against ( i ) i [ t ] subscript subscript 𝑖 𝑖 delimited-[] 𝑡 ({\ell}_{i})_{i\in[t]} is defined as

P quantal η t ( a | ( i ) i [ t ] ) := P quantal η t ( a | i = 1 t i ) . assign superscript subscript 𝑃 quantal subscript 𝜂 𝑡 conditional 𝑎 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 superscript subscript 𝑃 quantal subscript 𝜂 𝑡 conditional 𝑎 superscript subscript 𝑖 1 𝑡 subscript 𝑖 P_{\text{quantal}}^{\eta_{t}}\left(a{\,\big{|}\,}({\ell}_{i})_{i\in[t]}\right):=P_{\text{quantal}}^{\eta_{t}}\left(a{\,\bigg{|}\,}\sum_{i=1}^{t}{\ell}_{i}\right).

For t = 0 𝑡 0 t=0 , we define P quantal η t ( a | ( i ) i [ t ] ) := P quantal η 0 ( a | ) = 1 d assign superscript subscript 𝑃 quantal subscript 𝜂 𝑡 conditional 𝑎 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 superscript subscript 𝑃 quantal subscript 𝜂 0 conditional 𝑎 1 𝑑 P_{\text{quantal}}^{\eta_{t}}\left(a{\,\big{|}\,}({\ell}_{i})_{i\in[t]}\right):=P_{\text{quantal}}^{\eta_{0}}\left(a{\,\big{|}\,}\emptyset\right)=\frac{1}{d} .

In simpler terms, the generalized quantal response is defined as the standard quantal response against some scaled summation of the losses. Note that such a dynamic version of quantal response also has implications from behavior economics, and has been recently used to model human behaviors in sequential decision-making (Ding et al., 2022 ) (in stochastic and stationary environments). Indeed, such a response against multiple loss vectors is believed to be natural, and has also been widely adopted in well-known no-regret learning algorithms of smooth/stochastic fictitious play (Fudenberg and Kreps, 1993 ) and follow-the-perturbed-leader (Kalai and Vempala, 2005 ) , whose formal definitions can be found in Section A.3 . Finally, note that the response model in Definition 2 does not necessarily involve a sequential decision-making process, i.e., the set of losses may not come from the history of an online learning process. With this (natural) human response model in hand, we hypothesize that powerful pre-trained LLMs also behave similarly as such a model. To illustrate how Definition 2 can potentially capture also the behavior of LLMs instead of only humans, we have also provided examples of LLMs explaining how they output their policies in Section B.4 . It can be observed that the LLM agent tends to use the history of the reward vectors by looking at the “sum/average of the accumulated rewards”, and tends to introduce “randomization” in decision-making, in order to be “less predictable and exploitable”. These are known to be key to achieving no-regret behaviors in online learning and games (Cesa-Bianchi and Lugosi, 2006 ) . Hence, we analyze a case where pre-training under certain canonical data assumptions can provably lead to the quantal response behaviors and further yield no-regret guarantees, to gain some insights into the empirical observations in Sections 3.2 and 3.3 .

4.3 Case Study: Pre-Training under Canonical Data Distribution

Pre-training of LLMs often involves the approach of next-token prediction . When applying LLMs to sequential decision-making, the model receives the context of the decision-making sequence and then generates the action . This process can be conceptualized as predicting the optimal action in the form of the next token prediction. For instance, Yao et al. ( 2023b ); Shinn et al. ( 2023 ); Liu et al. ( 2023a , c ) demonstrated how decision-making may be framed in this way. Concretely, LLM receives the context as ( x 1 , x 2 , , x N 1 ) subscript 𝑥 1 subscript 𝑥 2 subscript 𝑥 subscript 𝑁 1 (x_{1},x_{2},\cdots,x_{N_{1}}) and output ( x N 1 + 1 , , x N 2 ) subscript 𝑥 subscript 𝑁 1 1 subscript 𝑥 subscript 𝑁 2 (x_{N_{1}+1},\cdots,x_{N_{2}}) for some N 1 , N 2 + subscript 𝑁 1 subscript 𝑁 2 superscript N_{1},N_{2}\in\mathbb{N}^{+} and N 1 < N 2 subscript 𝑁 1 subscript 𝑁 2 N_{1}<N_{2} , where each x i 𝒱 subscript 𝑥 𝑖 𝒱 x_{i}\in\mathcal{V} represents one natural language token for i [ N 2 ] 𝑖 delimited-[] subscript 𝑁 2 i\in[N_{2}] , and 𝒱 𝒱 \mathcal{V} is the finite token set. Afterwards, the human user extracts the corresponding decision action a 𝑎 a from the text ( x N 1 + 1 , , x N 2 ) subscript 𝑥 subscript 𝑁 1 1 subscript 𝑥 subscript 𝑁 2 (x_{N_{1}+1},\cdots,x_{N_{2}}) .

Meanwhile, in these cases, large models such as Transformers are often (pre-)trained for (sequential) decision-making under several fixed/stationary environments (Laskin et al., 2023 ; Lin et al., 2024 ; Lee et al., 2023 ; Reed et al., 2022 ) , which may limit their ability to handle arbitrary/non-stationary/adversarial loss sequences in our online learning setup. Thus, it is natural to ask: Is it possible to have no-regret behaviors emerging as a consequence of this (optimal) action prediction, under only a fixed pre-training distribution of stationary environments?

Here we analyze a standard pre-training objective on a token sequence distribution x 1 : N t + 1 P t t e x t similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 x_{1:N_{t+1}}\sim P^{text}_{t} for given t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , which is the expected log-likelihood maximization for next-token prediction:

max θ Θ 𝔼 x 1 : N t + 1 P t t e x t j = 1 N t + 1 log LLM θ ( x j | x 1 : j 1 ) , subscript 𝜃 Θ subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 superscript subscript 𝑗 1 subscript 𝑁 𝑡 1 subscript LLM 𝜃 conditional subscript 𝑥 𝑗 subscript 𝑥 : 1 𝑗 1 \displaystyle\max_{\theta\in\Theta}\quad\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\sum_{j=1}^{N_{t+1}}\log\text{LLM}_{\theta}\left(x_{j}{\,|\,}x_{1:j-1}\right), (1)

where we define LLM θ ( x 1 | x 1 : 0 ) = LLM θ ( x 1 ) subscript LLM 𝜃 conditional subscript 𝑥 1 subscript 𝑥 : 1 0 subscript LLM 𝜃 subscript 𝑥 1 \text{LLM}_{\theta}\left(x_{1}{\,|\,}x_{1:0}\right)=\text{LLM}_{\theta}\left(x_{1}\right) , and Θ Θ \Theta denotes the parameter space of the LLM.

For the pre-training data distribution, intuitively, we model it as follows: there exists a latent variable z 𝑧 z , representing the loss of the underlying static decision-making problem. The pre-training dataset, however, only contains partial observations x 1 : N t subscript 𝑥 : 1 subscript 𝑁 𝑡 x_{1:N_{t}} (a natural language representation of 1 : t subscript : 1 𝑡 {\ell}_{1:t} ) of z 𝑧 z due to imperfect data collection. The presence of partial observations could be attributed to the fact that z 𝑧 z is only privately known by the data-generator (human), representing the intentions of a human being/data-generator. Hence, LLM will only be pre-trained with partial and noisy information about z 𝑧 z . Meanwhile, we assume that the optimal action x N t + 1 : N t + 1 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 x_{N_{t}+1:N_{t+1}} (a natural language representation of a 𝑎 a ) w.r.t. the underlying loss vector z 𝑧 z is available in the pre-training dataset as the label , which could come from user surveys, personal blogs, or explicit data annotation. We formalize such a scenario in the assumption below:

Assumption 1 .

Given T + 𝑇 superscript T\in\mathbb{N}^{+} , t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , and N t + 1 + subscript 𝑁 𝑡 1 superscript N_{t+1}\in\mathbb{N}^{+} , there is a latent variable z 𝑧 z , 1 : t subscript : 1 𝑡 \ell_{1:t} , N 1 , N t [ N t + 1 ] subscript 𝑁 1 subscript 𝑁 𝑡 delimited-[] subscript 𝑁 𝑡 1 N_{1},\cdots N_{t}\in[N_{t+1}] , and N 0 = 0 subscript 𝑁 0 0 N_{0}=0 , such that ( z , 1 : t , x 1 : N t + 1 ) = ( z , 1 : t ) ( x 1 : N t | 1 : t ) ( x N t + 1 : N t + 1 | z ) , 𝑧 subscript : 1 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 𝑧 subscript : 1 𝑡 conditional subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript : 1 𝑡 conditional subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 𝑧 \mathbb{P}(z,\ell_{1:t},x_{1:N_{t+1}})=\mathbb{P}(z,\ell_{1:t})\mathbb{P}(x_{1:N_{t}}{\,|\,}\ell_{1:t})\mathbb{P}(x_{N_{t}+1:N_{t+1}}{\,|\,}z), and

P t t e x t ( x 1 : N t + 1 ) = ( x 1 : N t + 1 ) = z 1 : t ( z , 1 : t , x 1 : N t + 1 ) 𝑑 1 : t 𝑑 z , subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript 𝑧 subscript subscript : 1 𝑡 𝑧 subscript : 1 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 differential-d subscript : 1 𝑡 differential-d 𝑧 \displaystyle P^{text}_{t}(x_{1:N_{t+1}})=\mathbb{P}(x_{1:N_{t+1}})=\int_{z}\int_{\ell_{1:t}}\mathbb{P}(z,\ell_{1:t},x_{1:N_{t+1}})d\ell_{1:t}dz,

where we assume tokens x 1 : N t subscript 𝑥 : 1 subscript 𝑁 𝑡 x_{1:N_{t}} encode the context information, i.e., information for 1 : t subscript : 1 𝑡 \ell_{1:t} , and the user will decode action a 𝑎 a from x N t + 1 : N t + 1 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 x_{N_{t}+1:N_{t+1}} . More formally, we denote the mappings that decode such semantic information into numeric values as f 𝑓 f , g 𝑔 g , such that f ( x N i 1 + 1 : N i ) = i d 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 subscript 𝑖 superscript 𝑑 f(x_{N_{i-1}+1:N_{i}})=\ell_{i}\in\mathbb{R}^{d} for each i [ t ] 𝑖 delimited-[] 𝑡 i\in[t] and g ( x N t + 1 : N t + 1 ) = a 𝒜 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 𝑎 𝒜 g(x_{N_{t}+1:N_{t+1}})=a\in\mathcal{A} .

To further understand our assumption, we provide the following example and lemma, which recovers the prominent human behavior model introduced in Section 4.2 , quantal response, under some canonical data distributions.

Example 1 (An example instantiating 1 ) .

We consider a common decision-making task that may generate the training data, recommender systems . An instance of the text data could be: “On September 29, 2023, user X clicked movie A three times, movie B eight times, and movie C five times”. This sentence corresponds to x N i 1 + 1 : N i subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 x_{N_{i-1}+1:N_{i}} for some i [ t ] 𝑖 delimited-[] 𝑡 i\in[t] and serves as a natural language depiction of the numerical i subscript 𝑖 \ell_{i} . The corresponding label x N t + 1 : N t + 1 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 x_{N_{t}+1:N_{t+1}} can be obtained by some user survey: “User X’s favorite movie is movie B”. Meanwhile, z 𝑧 z represents user X’s latent, genuine preference for each movie – information that is private to the user, and cannot be observed or collected in the pre-training dataset. In this example, 1 suggests that x 1 : N t subscript 𝑥 : 1 subscript 𝑁 𝑡 x_{1:N_{t}} , which records the frequency of interactions with each movie, serves as an imperfect estimate of the user’s latent, genuine preference for the movies, while the text x N t + 1 : N t + 1 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 x_{N_{t}+1:N_{t+1}} depicts the user’s favorite movie only based on her latent z 𝑧 z .

Lemma 1 .

(Alignment of 1 with quantal response) . Fix t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , σ > 0 𝜎 0 \sigma>0 . If we model the noise of data collection to be i.i.d. Gaussian distribution in the numeric value space, i.e.,

( { f ( x N i 1 + 1 : N i ) } i [ t ] | z ) = i = 1 t ( f ( x N i 1 + 1 : N i ) | z ) i = 1 t exp ( f ( x N i 1 + 1 : N i ) z 2 2 2 σ 2 ) , conditional subscript 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑖 delimited-[] 𝑡 𝑧 superscript subscript product 𝑖 1 𝑡 conditional 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑧 proportional-to superscript subscript product 𝑖 1 𝑡 superscript subscript norm 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑧 2 2 2 superscript 𝜎 2 \mathbb{P}\left(\left\{f(x_{N_{i-1}+1:N_{i}})\right\}_{i\in[t]}{\,\big{|}\,}z\right)=\prod_{i=1}^{t}\mathbb{P}\left(f(x_{N_{i-1}+1:N_{i}}){\,\big{|}\,}z\right)\propto\prod_{i=1}^{t}\exp\left(-\frac{\|f(x_{N_{i-1}+1:N_{i}})-z\|_{2}^{2}}{2\sigma^{2}}\right),

the prior distribution of the latent variable z 𝑧 z is also Gaussian, i.e., z 𝒩 ( 𝟎 d , σ 2 I ) similar-to 𝑧 𝒩 subscript 0 𝑑 superscript 𝜎 2 𝐼 z\sim\mathcal{N}(\boldsymbol{0}_{d},\sigma^{2}I) , and the text labels satisfy that ( g ( x N t + 1 : N t + 1 ) | z ) = 𝟙 ( g ( x N t + 1 : N t + 1 ) arg min a 𝒜 z a ) conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 𝑧 1 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript arg min 𝑎 𝒜 subscript 𝑧 𝑎 \mathbb{P}(g(x_{N_{t}+1:N_{t+1}}){\,|\,}z)=\mathbbm{1}\left(g(x_{N_{t}+1:N_{t+1}})\in\operatorname*{arg\,min}_{a\in\mathcal{A}}z_{a}\right) , then we have

( g ( x N t + 1 : N t + 1 ) | x 1 : N t ) = P q u a n t a l σ t + 1 ( g ( x N t + 1 : N t + 1 ) | { f ( x N i 1 + 1 : N i ) } i [ t ] ) , conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 superscript subscript 𝑃 𝑞 𝑢 𝑎 𝑛 𝑡 𝑎 𝑙 𝜎 𝑡 1 conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑖 delimited-[] 𝑡 \mathbb{P}\left(g(x_{N_{t}+1:N_{t+1}}){\,\big{|}\,}x_{1:N_{t}}\right)=P_{quantal}^{\sigma\sqrt{t+1}}\left(g(x_{N_{t}+1:N_{t+1}}){\,\big{|}\,}\left\{f(x_{N_{i-1}+1:N_{i}})\right\}_{i\in[t]}\right),

with P n o i s e = 𝒩 ( 𝟎 d , I ) subscript 𝑃 𝑛 𝑜 𝑖 𝑠 𝑒 𝒩 subscript 0 𝑑 𝐼 P_{noise}=\mathcal{N}(\boldsymbol{0}_{d},I) in Definition 2 , i.e., the action a = g ( x N t + 1 : N t + 1 ) 𝑎 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 a=g(x_{N_{t}+1:N_{t+1}}) extracted from the text x N t + 1 : N t + 1 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 x_{N_{t}+1:N_{t+1}} is a quantal response w.r.t. the loss vectors ( f ( x N i 1 + 1 : N i ) ) i [ t ] subscript 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑖 delimited-[] 𝑡 \left(f(x_{N_{i-1}+1:N_{i}})\right)_{i\in[t]} .

We defer the proof of the lemma to Section C.2 . Now based on this lemma, we provide the no-regret guarantees of the corresponding pre-trained LLM.

Theorem 1 .

(Emergence of no-regret behavior) . Under the assumptions of Lemma 1 , suppose the function class of LLM θ subscript LLM 𝜃 \text{LLM}_{\theta} is expressive enough such that for all t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , max θ Θ 𝔼 x 1 : N t + 1 P t t e x t j = 1 N t + 1 log LLM θ ( x j | x 1 : j 1 ) = max { q j { 𝒱 j 1 Δ ( 𝒱 ) } } j [ N t + 1 ] 𝔼 x 1 : N t + 1 P t t e x t j = 1 N t + 1 log q j ( x j | x 1 : j 1 ) subscript 𝜃 Θ subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 superscript subscript 𝑗 1 subscript 𝑁 𝑡 1 subscript LLM 𝜃 conditional subscript 𝑥 𝑗 subscript 𝑥 : 1 𝑗 1 subscript subscript subscript 𝑞 𝑗 superscript 𝒱 𝑗 1 Δ 𝒱 𝑗 delimited-[] subscript 𝑁 𝑡 1 subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 superscript subscript 𝑗 1 subscript 𝑁 𝑡 1 subscript 𝑞 𝑗 conditional subscript 𝑥 𝑗 subscript 𝑥 : 1 𝑗 1 \max_{\theta\in\Theta}\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\sum_{j=1}^{N_{t+1}}\log\text{LLM}_{\theta}\left(x_{j}{\,|\,}x_{1:j-1}\right)=\max_{\left\{q_{j}\in\left\{\mathcal{V}^{j-1}\rightarrow\Delta(\mathcal{V})\right\}\right\}_{j\in[N_{t+1}]}}\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\sum_{j=1}^{N_{t+1}}\log q_{j}\left(x_{j}{\,|\,}x_{1:j-1}\right) , where we define q 1 ( x 1 | x 1 : 0 ) := q 1 ( x 1 ) assign subscript 𝑞 1 conditional subscript 𝑥 1 subscript 𝑥 : 1 0 subscript 𝑞 1 subscript 𝑥 1 q_{1}(x_{1}{\,|\,}x_{1:0}):=q_{1}(x_{1}) , and θ superscript 𝜃 \theta^{\star} maximizes Equation 1 . Then, there exist (simple) algorithms using LLM θ subscript LLM superscript 𝜃 \text{LLM}_{\theta^{\star}} to achieve no (dynamic) regret for (non-stationary) online learning with full-information/bandit feedback. To be specific, for (2) and (4), by defining the variation bound t = 1 T 1 t + 1 t V T superscript subscript 𝑡 1 𝑇 1 subscript norm subscript 𝑡 1 subscript 𝑡 subscript 𝑉 𝑇 \sum_{t=1}^{T-1}\|\ell_{t+1}-\ell_{t}\|_{\infty}\leq V_{T} such that V T T subscript 𝑉 𝑇 𝑇 V_{T}\leq T and V T = Θ ( T ρ ) subscript 𝑉 𝑇 Θ superscript 𝑇 𝜌 V_{T}=\Theta(T^{\rho}) for some ρ ( 0 , 1 ) 𝜌 0 1 \rho\in(0,1) , it holds that for large enough T 𝑇 T , d 𝑑 d :

(1) For online learning with full-information feedback, Regret LLM θ ( ( t ) t [ T ] ) 𝒪 ( T log d ) \text{Regret}_{\text{LLM}{{}_{\theta^{\star}}}}\left(({\ell}_{t})_{t\in[T]}\right)\leq\mathcal{O}\left(\sqrt{T\log d}\right) ;

(2) For non-stationary online learning with full-information feedback,

D-Regret LLM θ ( ( t ) t [ T ] ) 𝒪 ( ( log d V T ) 1 / 3 T 2 / 3 ) ; \text{D-Regret}_{\text{LLM}{{}_{\theta^{\star}}}}\left(({\ell}_{t})_{t\in[T]}\right)\leq\mathcal{O}\left((\log d\ V_{T})^{1/3}T^{2/3}\right);

(3) For online learning with bandit feedback,

𝔼 [ Regret LLM θ ( ( t ) t [ T ] ) ] 𝒪 ( ( log d ) 1 / 2 d T 1 / 2 + 1 / log T log T ) ; \mathbb{E}\left[\text{Regret}_{\text{LLM}{{}_{\theta^{\star}}}}\left(({\ell}_{t})_{t\in[T]}\right)\right]\leq\mathcal{O}\left((\log d)^{1/2}dT^{1/2+1/\log T}\log T\right);

(4) For non-stationary online learning with bandit feedback,

𝔼 [ D-Regret LLM θ ( ( t ) t [ T ] ) ] 𝒪 ( ( T 2 d 2 V T ) 1 / 3 ( log d ) 1 / 2 T 1 / log T log T ) . \mathbb{E}\left[\text{D-Regret}_{\text{LLM}{{}_{\theta^{\star}}}}\left(({\ell}_{t})_{t\in[T]}\right)\right]\leq\mathcal{O}\left((T^{2}d^{2}V_{T})^{1/3}(\log d)^{1/2}T^{1/\log T}\log T\right).
Remark 1 (Implication for playing repeated games) .

First, we note that the no-regret guarantee in the online setting is stronger than and thus implies that in the game setting, since regret by definition handles arbitrary/adversarial environments, while in playing games the opponents are not necessarily as adversarial. Second, it is folklore that if all players in the repeated game follow no-regret learning algorithms, then the time-average policies of all players during learning constitute an approximate coarse correlated equilibrium of the game (Cesa-Bianchi and Lugosi, 2006 ) . Hence, the results (1) and (2) in Theorem 1 imply that a coarse correlated equilibrium will emerge in the long run from the interactions of the LLM agents (under certain assumptions as in the theorem).

We present proofs of the non-asymptotic bounds for (dynamic) regret in Section C.4 . Furthermore, we demonstrate that the prior distribution of z 𝑧 z could also be replaced by a general distribution (c.f. Theorem 5 ), in order to obtain the above results. We also point out that 1 may be further relaxed to better match the actual LLMs’ pre-training data distributions from diverse sources (c.f. Section C.5 ).

It is important to observe that even when pre-training is conducted solely with stationary loss vector generation ( 1 : t subscript : 1 𝑡 \ell_{1:t} are i.i.d. conditioned on z 𝑧 z ), it can still lead to the emergence of no-regret behavior in online learning with potentially adversarial losses. Key in the proof is a connection of pre-trained LLM models to the online learning algorithm of FTPL.

Comparison to Lee et al. ( 2023 ); Lin et al. ( 2024 ) .

Intriguingly, similar assumptions and pre-training objectives have also been considered in the very recent work of Lee et al. ( 2023 ); Lin et al. ( 2024 ) for studying in-context reinforcement learning property of Transformers under supervised pre-training. Lee et al. ( 2023 ) established its equivalence to posterior sampling (Osband et al., 2013 ) , an important RL algorithm with provable regret guarantees when the environments are stationary , and Lin et al. ( 2024 ) generalized the study to the setting of algorithm distillation as in Laskin et al. ( 2023 ) . However, their results cannot directly imply the no-regret guarantee in our online learning setting, due to the fact that posterior sampling can perform poorly under potentially adversarial or non-stationary environments (Zimmert and Seldin, 2021 ; Liu et al., 2023b ) . In contrast, we here establish the equivalence of the pre-trained LLM to the FTPL algorithm (under different pre-training distribution specifications), with the ability to handle arbitrary loss sequences, even though the LLMs are only trained on a fixed distribution of texts (tasks).

Calibrating the degree of bounded rationality of actual LLMs.

To further validate our model and data distribution assumptions, we also propose to calibrate the parameter { η t } t [ T 1 ] subscript subscript 𝜂 𝑡 𝑡 delimited-[] 𝑇 1 \{\eta_{t}\}_{t\in[T-1]} in Definition 2 , the degree of bounded rationality, by estimating the parameters of { η t } t [ T 1 ] subscript subscript 𝜂 𝑡 𝑡 delimited-[] 𝑇 1 \{\eta_{t}\}_{t\in[T-1]} using data from interacting with LLMs (following the same protocol as before), with P n o i s e subscript 𝑃 𝑛 𝑜 𝑖 𝑠 𝑒 P_{{noise}} being a standard normal distribution (note that we do not need to calibrate η 0 subscript 𝜂 0 \eta_{0} by Definition 2 ). Specifically, given n 𝑛 n episodes of the LLM agent’s behavior { ( t ( j ) , π t ( j ) ) t [ T ] } j [ n ] subscript subscript superscript subscript 𝑡 𝑗 superscript subscript 𝜋 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑛 \{({\ell}_{t}^{(j)},\pi_{t}^{(j)})_{t\in[T]}\}_{j\in[n]} , motivated by our Lemma 1 and Theorem 1 , we calibrate { η t } t [ T 1 ] subscript subscript 𝜂 𝑡 𝑡 delimited-[] 𝑇 1 \{\eta_{t}\}_{t\in[T-1]} by solving the following problem

σ arg min σ > 0 t [ T 1 ] j [ n ] π t + 1 ( j ) P q u a n t a l σ t + 1 ( | 1 : t ( j ) ) 1 , η t = σ t + 1 , t [ T 1 ] . \displaystyle\sigma^{\star}\in\arg\min_{\sigma>0}\quad\sum_{t\in[T-1]}\sum_{j\in[n]}\Big{\|}\pi_{t+1}^{(j)}-P^{\sigma\sqrt{t+1}}_{quantal}\Big{(}\cdot{\,\bigg{|}\,}{\ell}_{1:t}^{(j)}\Big{)}\Big{\|}_{1},\qquad\quad\eta_{t}^{\star}=\sigma^{\star}\sqrt{t+1},\quad\forall t\in[T-1].

We solve this single-variable optimization problem by grid search over [ 0 , 10 ] 0 10 [0,10] . We then run the generalized quantal response model with the calibrated { η t } t [ T 1 ] subscript superscript subscript 𝜂 𝑡 𝑡 delimited-[] 𝑇 1 \{\eta_{t}^{\star}\}_{t\in[T-1]} on another unseen test set , and compare it with the behavior of the actual LLM agents. We use all the interaction data from Section 3.2 and split it in half for training and testing. In Figure 8 , we show the averaged regret for the LLM agent and the calibrated generalized quantal response. It can be seen that calibrated generalized quantal response can very well capture the behavior of the LLM agent for all problem instances in Section 3.2 , justifying the applicability of our hypothetical model and assumptions.

Refer to caption
Refer to caption
Refer to caption
Figure 8 : (left) Comparison of GPT-4 with a calibrated agent on the test set, where the calibrated quantal response can perfectly capture the behavior of the GPT-4 agent. (mid, right) The calibrated agent on the less predictable and adaptive loss sequences failed to make accurate predictions for the GPT-4 anymore.

We also use the same framework to understand the regrettable behaviors in Section 3.4 . This analysis uses all the data from Section 3.4 . We first find that such fitting procedures do not yield good predictions for LLMs on those counter-examples. Therefore, we resort to a more expressive model by directly fitting each η t subscript 𝜂 𝑡 \eta_{t} as η t arg min η t > 0 j [ n ] π t + 1 ( j ) P q u a n t a l η t ( | 1 : t ( j ) ) 1 \eta_{t}^{\star}\in\arg\min_{\eta_{t}>0}\sum_{j\in[n]}\Big{\|}\pi_{t+1}^{(j)}-P^{\eta_{t}}_{quantal}\Big{(}\cdot{\,\bigg{|}\,}{\ell}_{1:t}^{(j)}\Big{)}\Big{\|}_{1} separately for each t [ T 1 ] 𝑡 delimited-[] 𝑇 1 t\in[T-1] . Even under the expressive model, LLMs fail to follow the generalized quantal response for the counter-examples with less predictable or adaptive loss sequences, as Figure 8 shows the gap between GPT-4 (dynamic) regret and the calibrated agent (dynamic) regret.

Finally, we acknowledge that for most existing pre-trained LLMs like GPT-4, the canonical assumptions above, though may be further relaxed (c.f. Remark 4 ), may not hold in general. More importantly, the supervision labels , i.e., the optimal action given z 𝑧 z , may be sometimes imperfect or unavailable during the dataset collection. Hence, it is completely possible to observe regrettable behaviors (c.f. LABEL:{sec:yes_regret_example} ). Motivated by these caveats, we next propose a new training loss that is unsupervised , and can promote no-regret behavior provably.

5 Provably Promoting No-Regret Behavior by an Unsupervised Loss

In light of the observations in Section 3 , we ask the question:

Is there a way to further enhance the no-regret property of LLM agents,
hopefully without (optimal) action labels?

To address this question, we propose to train LLMs with a new unsupervised learning loss that naturally provides no-regret behaviors. This approach is akin to the process of “instruction tuning” (Wei et al., 2021 ) , which was shown to have enhanced LLMs’ ability when learning from context, with both theoretical (Ahn et al., 2023 ; Mahankali et al., 2023 ; Zhang et al., 2023a ) and empirical (Lu et al., 2023 ) evidence.

5.1 A New Unsupervised Training Loss: Regret-Loss

Intuitively, our new training loss is designed to enforce the trained LLM to minimize the regret under an arbitrary sequence of loss vectors. Specifically, we define the training loss as

( θ ) := max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) assign 𝜃 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle{\mathcal{L}(\theta)}:=\max_{\ell_{1},\dots,\ell_{T}}~{}~{}~{}\text{Regret}_{\text{LLM}_{\theta}}\left((\ell_{t})_{t\in[T]}\right) (2)

where t B subscript norm subscript 𝑡 𝐵 \|\ell_{t}\|_{\infty}\leq B for t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] . As discussed in Kirschner et al. ( 2023 ) , directly minimizing the max regret can be computationally challenging, except for superficially simple problems. Hence, in practice, one may parameterize the LLM and resort to differentiable programming to solve it approximately. However, Equation 2 is not necessarily differentiable with respect to parameter θ 𝜃 \theta , if it does not satisfy the condition of Danskin’s Theorem (Danskin, 1966 ) ; or even if it is differentiable (i.e., the maximizer of ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 (\ell_{t})_{t\in[T]} is unique), computation of derivatives can be challenging since we need to calculate arg max ( t ) t [ T ] Regret LLM θ ( ( t ) t [ T ] ) subscript arg max subscript subscript 𝑡 𝑡 delimited-[] 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \operatorname*{arg\,max}_{(\ell_{t})_{t\in[T]}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}) while we have inf infimum \inf in the definition of regret. Therefore, we provide a general framework so that we can approximate Equation 2 by the following surrogate:

( θ , k , N ) := 𝔼 [ j [ N ] h ( Regret LLM θ ( ( t ( j ) ) t [ T ] ) ) f ( Regret LLM θ ( ( t ( j ) ) t [ T ] ) , k ) j [ N ] f ( Regret LLM θ ( ( t ( j ) ) t [ T ] ) , k ) ] , assign 𝜃 𝑘 𝑁 𝔼 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑓 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑗 delimited-[] 𝑁 𝑓 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 \displaystyle\mathcal{L}(\theta,k,N):=\mathbb{E}\left[\frac{\sum_{j\in[N]}h(\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))f(\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}),k)}{\sum_{j\in[N]}f(\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}),k)}\right], (3)

where k + 𝑘 superscript k\in\mathbb{N}^{+} , N + 𝑁 superscript N\in\mathbb{N}^{+} , h : + : superscript h:\mathbb{R}\to\mathbb{R}^{+} is a continuous function, with continuous derivative h superscript h^{\prime} , and f ( , k ) : + : 𝑓 𝑘 superscript f(\cdot,k):\mathbb{R}\to\mathbb{R}^{+} is a continuous function for each k + 𝑘 superscript k\in\mathbb{N}^{+} , satisfying lim k f ( R 1 , k ) f ( R 2 , k ) = 𝟙 ( R 1 > R 2 ) + 𝟙 ( R 1 = R 2 ) subscript 𝑘 𝑓 subscript 𝑅 1 𝑘 𝑓 subscript 𝑅 2 𝑘 1 subscript 𝑅 1 subscript 𝑅 2 1 subscript 𝑅 1 subscript 𝑅 2 \lim_{k\to\infty}\frac{f(R_{1},k)}{f(R_{2},k)}=\infty\cdot\mathbbm{1}(R_{1}>R_{2})+\mathbbm{1}(R_{1}=R_{2}) , where we use the convention of 0 = 0 0 0 \infty\cdot 0=0 . These conditions on h , f 𝑓 h,f will be assumed throughout the paper. Examples of such an f 𝑓 f include f ( x , k ) = x k 𝑓 𝑥 𝑘 superscript 𝑥 𝑘 f(x,k)=x^{k} and exp ( k x ) 𝑘 𝑥 \exp(kx) . We will sample N 𝑁 N trajectories of loss sequences ( t ( j ) ) t [ T ] , j [ N ] subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 (\ell_{t}^{(j)})_{t\in[T],j\in[N]} from some continuous probability distribution supported on [ B , B ] T × N superscript 𝐵 𝐵 𝑇 𝑁 [-B,B]^{T\times N} , and the expectation in Equation 3 is thus taken with respect to this distribution. Note that we do not have any statistical assumption on ( t ( j ) ) t [ T ] , j [ N ] subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 (\ell_{t}^{(j)})_{t\in[T],j\in[N]} (except being continuous and supported on [ B , B ] T × N superscript 𝐵 𝐵 𝑇 𝑁 [-B,B]^{T\times N} ), in contrast to those in Section 4 to justify the no-regret property of pre-trained LLMs.

In Section D.2 , we prove that under certain regularity conditions of f 𝑓 f and h h , we have

lim N , k ( θ , k , N ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) , subscript 𝑁 𝑘 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N,k\to\infty}\mathcal{L}(\theta,k,N)=h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right),

as well as the uniform convergence of ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) : lim N , k sup θ Θ | h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) ( θ , k , N ) | = 0 , subscript 𝑁 𝑘 subscript supremum 𝜃 Θ subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 𝜃 𝑘 𝑁 0 \lim_{N,k\to\infty}\sup_{\theta\in\Theta}\Big{|}h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))-\mathcal{L}(\theta,k,N)\Big{|}=0, where Θ Θ \Theta is some compact set of the LLM parameter. Hence, one can expect that minimizing the loss function in Equation 3 with large enough k 𝑘 k and N 𝑁 N may promote the trained LLM to have a small regret value. We will hereafter refer to Equation 3 as the regret-loss . Similarly, we can also define dynamic-regret-loss , and the results to be presented next can also generalize to this case (c.f. Remark 5 in Section D.3 ).

5.2 Guarantees via Regret-Loss Minimization

We first establish a statistical guarantee under general parameterizations of LLM θ subscript LLM 𝜃 \text{LLM}_{\theta} that is Lipschitz with respect to θ 𝜃 \theta , including the Transformer-based models as used in GPT-4 and most existing LLMs (see Proposition 2 for an example with formal statement). This guarantee focuses on their generalization ability when trained to minimize the empirical regret loss, which is defined as follows:

Definition 3 (Empirical loss function) .

We define the empirical loss ^ ^ \widehat{\mathcal{L}} computed with N T subscript 𝑁 𝑇 N_{T} samples as follows:

^ ( θ , k , N , N T ) := 1 N T s = 1 N T [ j [ N ] h ( Regret LLM θ ( ( s , t ( j ) ) t [ T ] ) ) f ( Regret LLM θ ( ( s , t ( j ) ) t [ T ] ) , k ) j [ N ] f ( Regret LLM θ ( ( s , t ( j ) ) t [ T ] ) , k ) ] assign ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 1 subscript 𝑁 𝑇 superscript subscript 𝑠 1 subscript 𝑁 𝑇 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑠 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑓 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑠 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑗 delimited-[] 𝑁 𝑓 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑠 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 \displaystyle\widehat{\mathcal{L}}(\theta,k,N,N_{T}):=\frac{1}{N_{T}}\sum_{s=1}^{N_{T}}\left[\frac{\sum_{j\in[N]}h\left(\emph{Regret}_{\emph{LLM}_{\theta}}((\ell_{s,t}^{(j)})_{t\in[T]})\right)f\left(\emph{Regret}_{\emph{LLM}_{\theta}}((\ell_{s,t}^{(j)})_{t\in[T]}),k\right)}{\sum_{j\in[N]}f\left(\emph{Regret}_{\emph{LLM}_{\theta}}((\ell_{s,t}^{(j)})_{t\in[T]}),k\right)}\right] (4)

where ( s , t ( j ) ) j [ N ] , t [ T ] subscript superscript subscript 𝑠 𝑡 𝑗 formulae-sequence 𝑗 delimited-[] 𝑁 𝑡 delimited-[] 𝑇 (\ell_{s,t}^{(j)})_{j\in[N],t\in[T]} denotes the s 𝑠 s -th sample of ( t ( j ) ) j [ N ] , t [ T ] subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑗 delimited-[] 𝑁 𝑡 delimited-[] 𝑇 (\ell_{t}^{(j)})_{j\in[N],t\in[T]} for estimating ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) .

We denote θ ^ k , N , N T arg min θ Θ ^ ( θ , k , N , N T ) subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 subscript arg min 𝜃 Θ ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 \widehat{\theta}_{k,N,N_{T}}\in\operatorname*{arg\,min}_{\theta\in\Theta}~{}\widehat{\mathcal{L}}(\theta,k,N,N_{T}) , and present the generalization guarantee below.

Theorem 2 .

(Generalization gap) . Suppose LLM θ subscript LLM 𝜃 \emph{LLM}_{\theta} is Lipschitz-continuous with respect to the model parameter θ 𝜃 \theta , then for any 0 < ϵ < 1 / 2 0 italic-ϵ 1 2 0<\epsilon<1/2 , with probability at least 1 ϵ 1 italic-ϵ 1-\epsilon , we have

( θ ^ k , N , N T , k , N ) subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 𝑘 𝑁 \displaystyle\mathcal{L}\left(\widehat{\theta}_{k,N,N_{T}},k,N\right) inf θ Θ ( θ , k , N ) 𝒪 ~ ( d θ + log ( 1 / ϵ ) N T ) , subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 ~ 𝒪 subscript 𝑑 𝜃 1 italic-ϵ subscript 𝑁 𝑇 \displaystyle-\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N)\leq\widetilde{\mathcal{O}}\left(\sqrt{\frac{d_{\theta}+\log(1/\epsilon)}{N_{T}}}\right), (5)

for any N 𝑁 N and sufficiently large k 𝑘 k , where d θ subscript 𝑑 𝜃 d_{\theta} is the dimension of the parameter θ 𝜃 \theta .

Through a careful use of Berge’s Maximum Theorem (Berge, 1877 ) , we prove that the right-hand side of Equation 5 does not depend on k 𝑘 k and N 𝑁 N , which allows us to take the limit of lim N lim k subscript 𝑁 subscript 𝑘 \lim_{N\to\infty}\lim_{k\to\infty} without affecting the generalization bound. Thanks to the uniform convergence of ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) (c.f. Section D.2 ), we further obtain the following corollary on the regret guarantee:

Corollary 1 .

(Regret) . Suppose § § § Note that these conditions on h , f 𝑓 h,f are in addition to those specified after Equation 3 . for any k + 𝑘 superscript k\in\mathbb{N}^{+} , h , f ( , k ) 𝑓 𝑘 h,f(\cdot,k) are non-decreasing, and log f 𝑓 \log f is a supermodular function (i.e., log f ( R 1 , k 1 ) log f ( R 1 , k 2 ) log f ( R 2 , k 1 ) log f ( R 2 , k 2 ) 𝑓 subscript 𝑅 1 subscript 𝑘 1 𝑓 subscript 𝑅 1 subscript 𝑘 2 𝑓 subscript 𝑅 2 subscript 𝑘 1 𝑓 subscript 𝑅 2 subscript 𝑘 2 \log f(R_{1},k_{1})-\log f(R_{1},k_{2})\geq\log f(R_{2},k_{1})-\log f(R_{2},k_{2}) for R 1 R 2 subscript 𝑅 1 subscript 𝑅 2 R_{1}\geq R_{2} and k 1 k 2 subscript 𝑘 1 subscript 𝑘 2 k_{1}\geq k_{2} ). Then, with high probably, we have

h ( lim N lim k max t B Regret LLM θ ^ k , N , N T ( ( t ) t [ T ] ) ) h ( inf θ Θ max t B Regret LLM θ ( ( t ) t [ T ] ) ) + 𝒪 ~ ( d θ N T ) . subscript 𝑁 subscript 𝑘 subscript subscript norm subscript 𝑡 𝐵 subscript Regret subscript LLM subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 subscript infimum 𝜃 Θ subscript subscript norm subscript 𝑡 𝐵 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 ~ 𝒪 subscript 𝑑 𝜃 subscript 𝑁 𝑇 \displaystyle h\left(\lim_{N\to\infty}\lim_{k\to\infty}\max_{\|\ell_{t}\|_{\infty}\leq B}\emph{Regret}_{\emph{LLM}_{\widehat{\theta}_{k,N,N_{T}}}}\left((\ell_{t})_{t\in[T]}\right)\right)\leq h\left(\inf_{{\theta\in\Theta}}\max_{\|\ell_{t}\|_{\infty}\leq B}\emph{Regret}_{\emph{LLM}_{{\theta}}}\left((\ell_{t})_{t\in[T]}\right)\right)+\widetilde{\mathcal{O}}\left(\sqrt{\frac{d_{\theta}}{{N_{T}}}}\right). (6)
Corollary 2 .

(Emerging behavior: Coarse correlated equilibrium) . For a sufficiently large N T subscript N T N_{T} , if each agent in the matrix game plays according to LLM θ ^ k , N , N T subscript LLM subscript ^ θ k N subscript N T \text{LLM}_{\widehat{\theta}_{k,N,N_{T}}} , then the time-averaged policy for each agent will constitute an approximate coarse correlated equilibrium of the game.

Proofs of Theorem 2 and Corollary 1 are deferred to Section D.3 , and Corollary 2 follows from the folklore result discussed in Remark 1 . Therefore, if additionally, the LLM parameterization (i.e., Transformers) can realize a no-regret algorithm (for example, the single-layer self-attention model can construct FTRL, as to be shown next in Section 5.3 ), then Corollary 1 means that with a large enough number of samples N T subscript 𝑁 𝑇 N_{T} , the learned LLM θ ^ k , N , N T subscript LLM subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 \text{LLM}_{\widehat{\theta}_{k,N,N_{T}}} becomes a no-regret learner, i.e., Regret LLM θ ^ k , N , N T ( ( t ) t [ T ] ) = o ( T ) subscript Regret subscript LLM subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 𝑜 𝑇 \text{Regret}_{\text{LLM}_{\widehat{\theta}_{k,N,N_{T}}}}\left((\ell_{t})_{t\in[T]}\right)=o(T) , since the first term on the right-hand-side of Equation 6 would directly be o ( T ) 𝑜 𝑇 o(T) under the choice of h ( x ) = max { 0 , x } 𝑥 0 𝑥 h(x)=\max\{0,x\} . For other choices of h h , one can use the inverse function of h h , i.e., h 1 superscript 1 h^{-1} (which always exists by our requirement of h h ), to ensure Regret LLM θ ^ k , N , N T ( ( t ) t [ T ] ) subscript Regret subscript LLM subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \text{Regret}_{\text{LLM}_{\widehat{\theta}_{k,N,N_{T}}}}((\ell_{t})_{t\in[T]}) is of order o ( T ) 𝑜 𝑇 o(T) .

Despite the power and generality of the previous results, one cannot use an infinitely large N 𝑁 N and k 𝑘 k in practical training. Hence, in the next subsection, we provide results when N 𝑁 N is finite, for the specific parameterization of the LLMs using Transformers.

5.3 Minimizing Regret-Loss Can Automatically Produce Known Online Learning Algorithms

We now study the setting of minimizing Equation 3 when LLM θ subscript LLM 𝜃 \text{LLM}_{\theta} is specifically parameterized by Transformers. As an initial step, we focus on single-layer (linear) self-attention models, as in most recent theoretical studies of Transformers (Ahn et al., 2023 ; Zhang et al., 2023a ; Mahankali et al., 2023 ) , and the more practical setting with a finite N = 1 𝑁 1 N=1 . Note that in this case, the choice of f 𝑓 f (and thus k 𝑘 k ) is not relevant. Thus, throughout this section, we drop superscript ( j ) 𝑗 (j) in Equation 3 for notational convenience. We sample t subscript 𝑡 \ell_{t} for t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] as realizations of some random variable Z 𝑍 Z . Here, we assume Z 𝑍 Z is symmetric about zero (i.e., Z = 𝑑 Z 𝑍 𝑑 𝑍 Z\overset{d}{=}-Z ), Var ( Z ) = Σ Var 𝑍 Σ \text{Var}(Z)=\Sigma is positive definite. We also assume that the support of Z 𝑍 Z has an interior such that it contains a ball centered at 𝟎 d subscript 0 𝑑 \boldsymbol{0}_{d} .

Firstly, we consider the following structure of single-layer self-attention model g 𝑔 g (see a formal introduction in Section A.1 ):

g ( Z t ; V , K , Q , v c , k c , q c ) := ( V 1 : t + v c 𝟏 t ) Softmax ( ( K 1 : t + k c 𝟏 t ) ( Q c + q c ) ) , assign 𝑔 subscript 𝑍 𝑡 𝑉 𝐾 𝑄 subscript 𝑣 𝑐 subscript 𝑘 𝑐 subscript 𝑞 𝑐 𝑉 subscript : 1 𝑡 subscript 𝑣 𝑐 superscript subscript 1 𝑡 Softmax superscript 𝐾 subscript : 1 𝑡 subscript 𝑘 𝑐 superscript subscript 1 𝑡 𝑄 𝑐 subscript 𝑞 𝑐 \displaystyle g(Z_{t};V,K,Q,v_{c},k_{c},q_{c}):=(V\ell_{1:t}+v_{c}\boldsymbol{1}_{t}^{\intercal})\texttt{Softmax}\left((K\ell_{1:t}+k_{c}\boldsymbol{1}_{t}^{\intercal})^{\intercal}\cdot(Qc+q_{c})\right), (7)

where Z t = ( 1 , , t , c ) subscript 𝑍 𝑡 subscript 1 subscript 𝑡 𝑐 Z_{t}=(\ell_{1},\dots,\ell_{t},c) , 1 : t d × t subscript : 1 𝑡 superscript 𝑑 𝑡 \ell_{1:t}\in\mathbb{R}^{d\times t} denotes a matrix with each column corresponding to each i subscript 𝑖 \ell_{i} , and V , K , Q d × d 𝑉 𝐾 𝑄 superscript 𝑑 𝑑 V,K,Q\in\mathbb{R}^{d\times d} correspond to the value, key, and query matrices, respectively, v c , k c , q c d subscript 𝑣 𝑐 subscript 𝑘 𝑐 subscript 𝑞 𝑐 superscript 𝑑 v_{c},k_{c},q_{c}\in\mathbb{R}^{d} correspond to the bias terms associated with V , K , Q 𝑉 𝐾 𝑄 V,K,Q , and c 𝟎 d 𝑐 subscript 0 𝑑 c\neq\boldsymbol{0}_{d} is a constant vector. We then have the following result.

Theorem 3 .

Consider the policy space Π = B ( 0 , R Π , ) \Pi=B(0,R_{\Pi},\|\cdot\|) for some R Π > 0 subscript 𝑅 Π 0 R_{\Pi}>0 . The configuration of a single-layer self-attention model in Equation 7 ( V , K , Q , v c , k c , q c ) 𝑉 𝐾 𝑄 subscript 𝑣 𝑐 subscript 𝑘 𝑐 subscript 𝑞 𝑐 (V,K,Q,v_{c},k_{c},q_{c}) such that K ( Q c + q c ) = v c = 𝟎 d superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 subscript 0 𝑑 K^{\intercal}(Qc+q_{c})=v_{c}=\boldsymbol{0}_{d} and

V = R Π T t = 1 T 1 1 / t Σ 1 𝔼 [ t = 1 T t 1 2 ] Σ 1 𝑉 subscript 𝑅 Π 𝑇 superscript subscript 𝑡 1 𝑇 1 1 𝑡 superscript Σ 1 𝔼 delimited-[] norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 subscript 1 superscript subscript 2 superscript Σ 1 V=-R_{\Pi}\frac{T}{\sum_{t=1}^{T-1}1/t}\Sigma^{-1}\mathbb{E}\left[\bigg{\|}{\sum_{t=1}^{T}\ell_{t}}\bigg{\|}\ell_{1}\ell_{2}^{\intercal}\right]\Sigma^{-1}

is a first-order stationary point of Equation 3 with N = 1 𝑁 1 N=1 , h ( x ) = x 2 𝑥 superscript 𝑥 2 h(x)=x^{2} . Moreover, if Σ Σ \Sigma is a diagonal matrix, then plugging this configuration into Equation 7 , and projecting the output with Proj Π , \texttt{Proj}_{\Pi,\|\cdot\|} would perform FTRL with an L 2 subscript 𝐿 2 L_{2} -regularizer for the loss vectors ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 (\ell_{t})_{t\in[T]} .

In practical training, such stationary points of the loss may be attained by first-order optimization algorithms of (stochastic) gradient descent, the workhorse in machine learning. Moreover, we also consider the single-layer linear self-attention model as follows, for which we can strengthen the results above from a stationary-point to an optimal-solution argument:

g ( Z t ; V , K , Q , v c , k c , q c ) = i = 1 t ( V i + v c ) ( ( K i + k c ) ( Q c + q c ) ) . 𝑔 subscript 𝑍 𝑡 𝑉 𝐾 𝑄 subscript 𝑣 𝑐 subscript 𝑘 𝑐 subscript 𝑞 𝑐 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 subscript 𝑣 𝑐 superscript 𝐾 subscript 𝑖 subscript 𝑘 𝑐 𝑄 𝑐 subscript 𝑞 𝑐 \displaystyle g(Z_{t};V,K,Q,v_{c},k_{c},q_{c})=\sum_{i=1}^{t}(V\ell_{i}+v_{c})\left((K\ell_{i}+k_{c})^{\intercal}\cdot(Qc+q_{c})\right). (8)
Theorem 4 .

Consider the policy space Π = B ( 0 , R Π , ) \Pi=B(0,R_{\Pi},\|\cdot\|) for some R Π > 0 subscript 𝑅 Π 0 R_{\Pi}>0 . The configuration of a single-layer linear self-attention model in Equation 8 ( V , K , Q , v c , k c , q c ) 𝑉 𝐾 𝑄 subscript 𝑣 𝑐 subscript 𝑘 𝑐 subscript 𝑞 𝑐 (V,K,Q,v_{c},k_{c},q_{c}) such that K ( Q c + q c ) = v c = 𝟎 d superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 subscript 0 𝑑 K^{\intercal}(Qc+q_{c})=v_{c}=\boldsymbol{0}_{d} and V = 2 R Π Σ 1 𝔼 ( t = 1 T t 1 2 ) Σ 1 𝑉 2 subscript 𝑅 Π superscript Σ 1 𝔼 norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 subscript 1 superscript subscript 2 superscript Σ 1 V=-2R_{\Pi}\Sigma^{-1}\mathbb{E}\left(\|\sum_{t=1}^{T}\ell_{t}\|\ell_{1}\ell_{2}^{\intercal}\right)\Sigma^{-1} is a global optimal solution of Equation 3 with N = 1 𝑁 1 N=1 , h ( x ) = x 2 𝑥 superscript 𝑥 2 h(x)=x^{2} . Moreover, every global optimal configuration of Equation 3 within the parameterization class of Equation 8 has the same output function g 𝑔 g . Additionally, if Σ Σ \Sigma is a diagonal matrix, then plugging any global optimal configuration into Equation 8 , and projecting the output with Proj Π , \texttt{Proj}_{\Pi,\|\cdot\|} would perform FTRL with an L 2 subscript 𝐿 2 L_{2} -regularizer for the loss vectors ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 (\ell_{t})_{t\in[T]} .

Theorem 4 shows the capacity of self-attention Transformer models to realize online learning algorithms, thanks to the regret-loss we proposed. In particular, this can be achieved automatically by optimizing the new loss, without hard-coding the parameters of the Transformer.

The above results are for the case of FTRL with an L 2 subscript 𝐿 2 L_{2} -regularizer, and it is possible to consider FTRL with an entropy regularizer , leading to the well-known Hedge algorithm (Freund and Schapire, 1997 ) that is more compatible with the simplex constraint on π 𝜋 \pi in the Experts Problem. We defer the discussion of this case to Section D.7 . Through these results, we can also guarantee in the repeated game setting that approximate coarse correlated equilibria would emerge in the long run, since each player will exhibit no-regret behavior, using a similar argument as that for Corollary 2 .

Remark 2 .

The very recent studies by (Ahn et al., 2023 ; Zhang et al., 2023a ; Mahankali et al., 2023 ) have demonstrated that if Z t = ( ( x 1 , y 1 ) , , ( x t , y t ) , ( x t + 1 , 0 ) ) subscript 𝑍 𝑡 subscript 𝑥 1 subscript 𝑦 1 subscript 𝑥 𝑡 subscript 𝑦 𝑡 subscript 𝑥 𝑡 1 0 Z_{t}=((x_{1},y_{1}),\dots,(x_{t},y_{t}),(x_{t+1},0)) and the “instruction tuning” loss (i.e., 𝔼 [ y ^ t + 1 y t + 1 2 ] 𝔼 delimited-[] superscript norm subscript ^ 𝑦 𝑡 1 subscript 𝑦 𝑡 1 2 \mathbb{E}[\|{\widehat{y}_{t+1}-y_{t+1}}\|^{2}] ) is being minimized with a single-layer linear self-attention model, then a global optimizer among single-layer linear self-attention models yields the output y ^ n + 1 = η i = 1 n y i x i x n + 1 subscript ^ 𝑦 𝑛 1 𝜂 superscript subscript 𝑖 1 𝑛 subscript 𝑦 𝑖 superscript subscript 𝑥 𝑖 subscript 𝑥 𝑛 1 \widehat{y}_{n+1}=\eta\sum_{i=1}^{n}y_{i}x_{i}^{\intercal}x_{n+1} . This output can be interpreted as a gradient descent algorithm, indicating that a single-layer linear self-attention model implicitly performs gradient descent. However, in the online learning setting where there are no y 𝑦 y -labels, such an implicit gradient descent update-rule is hard to define. Compared to the previous studies, our global optimizer among single-layer linear self-attention models is an explicit and online gradient descent update for online learning. With a different loss (regret-loss v.s. instruction-tuning-loss), the techniques to obtain the seemingly similar results are also fundamentally different.

5.4 Experimental Results for Minimizing Regret-Loss

We now provide experimental results for minimizing our regret-loss , and evaluate in the following environments: 1) randomly-generated loss sequences ( Figure 9 ); 2) loss sequences with a predictable trend ( Figure 10 ); 3) repeated games ( Figure 11 ); and 4) counterexamples for pre-trained LLMs to be regrettable ( Figure 7 ). Details of the training setup can be found in Section D.8 . We also provide an ablation study for the training of the loss Equation 3 in Section D.9 .

Randomly generated loss sequences.

We use the same loss vectors as those in Section 3.2 for randomly generated loss functions, and compare the results with that using GPT-4. The results show that with regret-loss, both the trained single-layer self-attention model and the trained Transformers with multi-layer self-attention structures can achieve comparable regrets as FTRL and GPT-4. The results can be found in Figure 9 .

Refer to caption
Refer to caption
Figure 9 : Regret performance for the randomly generated loss sequences that are generated by Gaussian with truncation and uniform distribution. No-regret behaviors of single-layer and multi-layer self-attention models are validated by both of our frameworks (low p 𝑝 p -values and β ^ 0 < 1 subscript ^ 𝛽 0 1 \widehat{\beta}_{0}<1 ).
Refer to caption
Refer to caption
Figure 10 : Regret performance for the randomly generated loss sequences that are generated by linear-trend and sine-trend. No-regret behaviors of single-layer and multi-layer self-attention models are validated by both of our frameworks (low p 𝑝 p -values and β ^ 0 < 1 subscript ^ 𝛽 0 1 \widehat{\beta}_{0}<1 ).
Loss sequences with a predictable trend.

We investigate the case where the loss sequences have predictable trends such as linear-trend or sine-trend. One might expect that the performance of the trained Transformer would surpass the performance of traditional no-regret learning algorithms such as FTRL, since they may not be an optimal algorithm for the loss sequence with a predictable trend. We modify the training distribution by changing the distribution of random variable Z 𝑍 Z (which generates the loss vectors t subscript 𝑡 \ell_{t} ) to follow two kinds of trends: linear and sine functions. The results, as illustrated in Figure 10 , show that the trained single-layer self-attention model and the trained Transformer with multi-layer self-attention structures with regret-loss outperformed GPT-4 and FTRL in terms of regret, when the loss sequence is a linear trend. Similarly, Figure 10 shows that the trained Transformer with multi-layer self-attention structures with regret-loss is comparable to GPT-4 and outperformed FTRL in terms of regret, when the loss sequence is a sine-trend. Note that the training dataset does not contain the sequence of losses. Nonetheless, by focusing on the overall trend during training, we can attain performance that is either superior to or on par with that of FTRL and GPT-4.

Repeated games.

We then investigate the case of multi-player repeated games. We study 2x2, 3x3x3, 3x3x3x3 games, where each entry of the payoff matrix is sampled randomly from Unif ( [ 0 , 10 ] ) Unif 0 10 \operatorname{Unif}([0,10]) . The results, as illustrated in Figure 11 , show that the trained single-layer self-attention model and the trained Transformer with multi-layer self-attention structures with regret-loss have a similar performance as that of FTRL. However, GPT-4 still outperforms the trained single-layer self-attention model and the trained Transformer with multi-layer self-attention structures in terms of regret. Since for repeated games (in which the environment faced by the agent can be less adversarial than that in the online setting), there might be a better algorithm than FTRL (see e.g., Daskalakis et al. ( 2021 ) ), while our self-attention models have a similar structure as FTRL ( Theorem 3 or Theorem 4 ). Also, in practical training (with the empirical loss in Equation 4 ), we possibly did not find the exact global minimum or stationary point of the expected loss in Equation 3 . Hence, it is possible that GPT-4 may have lower regret than our trained models with the regret-loss.

Refer to caption
Figure 11 : Regret performance for the game with two players, three players, and four players general-sum games. No-regret behaviors of single-layer and multi-layer self-attention models are validated by both of our frameworks (low p 𝑝 p -values and β ^ 0 < 1 subscript ^ 𝛽 0 1 \widehat{\beta}_{0}<1 ).
Two scenarios that caused regrettable behaviors of GPT-4.

Finally, we investigate the cases that have caused GPT-4 to have regrettable performance in Section 3.2 . The results, which can be found in Figure 7 , show that both the trained single-layer self-attention model and the trained Transformer with regret-loss can achieve comparable no-regret performance as FTRL, and outperforms that of GPT-4. This validates that our new unsupervised training loss can address the regrettable cases, as our theory in Sections 5.2 and 5.3 has predicted.

6 Concluding Remarks

In this paper, we studied the online decision-making and strategic behaviors of LLMs quantitatively, through the metric of regret. We first examined and validated the no-regret behavior of several representative pre-trained LLMs in benchmark settings of online learning and games. As a consequence, (coarse correlated) equilibrium can oftentimes emerge as the long-term outcome of multiple LLMs playing repeated games. We then provide some theoretical insights into the no-regret behavior, by connecting pre-trained LLMs to the follow-the-perturbed-leader algorithm in online learning, under certain assumptions. We also identified (simple) cases where pre-trained LLMs fail to be no-regret, and thus proposed a new unsupervised training loss, regret-loss , to provably promote the no-regret behavior of Transformers without the labels of (optimal) actions. We established both experimental and theoretical evidence for the effectiveness of our regret-loss.

As a first attempt toward rigorously understanding the online and strategic decision-making behaviors of LLMs through the metric of regret, our work has opened up fruitful directions for future research:

  • There are more than one definitions of (dynamic-)regret in the online learning literature, and we mainly focused on the so-called external-regret in the literature. It would be interesting to study the no-regret behavior of LLMs in terms of other regret metrics, e.g., swap-regret (Blum and Mansour, 2007 ) , which may lead to stronger equilibrium notions in playing repeated games.

  • Our new regret-loss has exhibited promises in our experiments for training modest-scale Transformers. We are currently generalizing it to training other larger-scale models, such as Foundation Models, for decision-making.

  • No-regret behavior can sometimes lead to better outcomes in terms of social efficiency (Blum et al., 2008 ; Roughgarden, 2015 ; Nekipelov et al., 2015 ) . It would thus be interesting to further validate the efficiency of no-regret LLM agents in these scenarios, as well as identifying new prompts and training losses for LLMs to promote the efficiency of the outcomes.

  • To evaluate the performance quantitatively, we focused on online learning and games with numeric valued payoffs. It would be interesting to connect our no-regret-based and game-theoretic framework with existing multi-LLM frameworks, e.g., debate, collaborative problem-solving, and human/social behavior simulation, with potentially new notions of regret (defined in different spaces) as performance metrics.

Acknowledgement

The authors thank Dingwen Kong for discussing the truncation idea in proving Lemma 8 . Also, the authors thank Kristian Georgiev and Aleksander Madry for the helpful feedback.

References

Supplementary Materials for

“Do LLM Agents Have Regret? A Case Study in Online Learning and Games”

Appendix A Deferred Background

A.1 Additional Definitions for Appendix

(Linear) Self-attention.

One key component in Transformers (Vaswani et al., 2017 ) , the backbone of modern language models, is the (self-)attention mechanism. For simplicity, we here focus on introducing the single-layer self-attention architecture. The mechanism takes a sequence of vectors Z = [ z 1 , , z t ] d × t 𝑍 subscript 𝑧 1 subscript 𝑧 𝑡 superscript 𝑑 𝑡 Z=[z_{1},\dots,z_{t}]\in\mathbb{R}^{d\times t} as input, and outputs some sequence of [ z ^ 1 , , z ^ t ] d × t subscript ^ 𝑧 1 subscript ^ 𝑧 𝑡 superscript 𝑑 𝑡 [\widehat{z}_{1},\dots,\widehat{z}_{t}]\in\mathbb{R}^{d\times t} . For each i [ t ] 𝑖 delimited-[] 𝑡 i\in[t] where i > 1 𝑖 1 i>1 , the output is generated by z ^ i = ( V z 1 : i 1 ) σ ( ( K z 1 : i 1 ) ( Q z i ) ) subscript ^ 𝑧 𝑖 𝑉 subscript 𝑧 : 1 𝑖 1 𝜎 superscript 𝐾 subscript 𝑧 : 1 𝑖 1 𝑄 subscript 𝑧 𝑖 \widehat{z}_{i}=(Vz_{1:{i-1}})\sigma((Kz_{1:i-1})^{\intercal}(Qz_{i})) , where z 1 : i 1 subscript 𝑧 : 1 𝑖 1 z_{1:{i-1}} denotes the 1 1 1 to i 1 𝑖 1 i-1 columns of Z 𝑍 Z , σ 𝜎 \sigma is either the Softmax or ReLU activation function, and for the initial output, z ^ 1 = 𝟎 d subscript ^ 𝑧 1 subscript 0 𝑑 \widehat{z}_{1}=\boldsymbol{0}_{d} . Here, V , Q , K d × d 𝑉 𝑄 𝐾 superscript 𝑑 𝑑 V,Q,K\in\mathbb{R}^{d\times d} are referred to as the Value , Query , and Key matrices, respectively. Following the theoretical framework in Von Oswald et al. ( 2023 ); Mahankali et al. ( 2023 ) , we exclude the attention score for a token z i subscript 𝑧 𝑖 z_{i} in relation to itself. For theoretical analysis, we also consider the linear self-attention model, where z ^ i = ( V z 1 : i 1 ) ( ( K z 1 : i 1 ) ( Q z i ) ) subscript ^ 𝑧 𝑖 𝑉 subscript 𝑧 : 1 𝑖 1 superscript 𝐾 subscript 𝑧 : 1 𝑖 1 𝑄 subscript 𝑧 𝑖 \widehat{z}_{i}=(Vz_{1:{i-1}})((Kz_{1:i-1})^{\intercal}(Qz_{i})) . We write this (linear) self-attention layer’s output as (L)SA ( V , Q , K ) ( Z ) subscript (L)SA 𝑉 𝑄 𝐾 𝑍 \texttt{(L)SA}_{(V,Q,K)}(Z) . We define an M 𝑀 M -head self-attention layer with θ = { ( V m , Q m , K m ) } m [ M ] 𝜃 subscript subscript 𝑉 𝑚 subscript 𝑄 𝑚 subscript 𝐾 𝑚 𝑚 delimited-[] 𝑀 \theta=\{(V_{m},Q_{m},K_{m})\}_{m\in[M]} as M-(L)SA θ ( Z ) := m = 1 M (L)SA ( V m , Q m , K m ) ( Z ) assign subscript M-(L)SA 𝜃 𝑍 superscript subscript 𝑚 1 𝑀 subscript (L)SA subscript 𝑉 𝑚 subscript 𝑄 𝑚 subscript 𝐾 𝑚 𝑍 \texttt{M-(L)SA}_{\theta}(Z):=\sum_{m=1}^{M}\texttt{(L)SA}_{(V_{m},Q_{m},K_{m})}(Z) . We define M-(L)SA \|\cdot\|_{\texttt{M-(L)SA}} as θ M-(L)SA := max m [ M ] { Q m op , K m op } + m = 1 M V m op assign subscript norm 𝜃 M-(L)SA subscript 𝑚 delimited-[] 𝑀 subscript norm subscript 𝑄 𝑚 op subscript norm subscript 𝐾 𝑚 op superscript subscript 𝑚 1 𝑀 subscript norm subscript 𝑉 𝑚 op \|\theta\|_{\texttt{M-(L)SA}}:=\max_{m\in[M]}\left\{\|Q_{m}\|_{\texttt{op}},\|K_{m}\|_{\texttt{op}}\right\}+\sum_{m=1}^{M}\|V_{m}\|_{\texttt{op}} .

Transformers.

For a multi-layer perceptron (MLP) layer, it takes Z = [ z 1 , , z t ] d × t 𝑍 subscript 𝑧 1 subscript 𝑧 𝑡 superscript 𝑑 𝑡 Z=[z_{1},\dots,z_{t}]\in\mathbb{R}^{d\times t} as input, with parameter θ = ( W 1 , W 2 ) d × d × d × d 𝜃 subscript 𝑊 1 subscript 𝑊 2 superscript superscript 𝑑 𝑑 superscript 𝑑 superscript 𝑑 \theta=(W_{1},W_{2})\in\mathbb{R}^{d^{\prime}\times d}\times\mathbb{R}^{d\times d^{\prime}} such that for each i [ t ] 𝑖 delimited-[] 𝑡 i\in[t] , the output is z i ^ := W 2 σ ( W 1 z i ) assign ^ subscript 𝑧 𝑖 subscript 𝑊 2 𝜎 subscript 𝑊 1 subscript 𝑧 𝑖 \widehat{z_{i}}:=W_{2}\sigma(W_{1}z_{i}) where σ 𝜎 \sigma is either Softmax or ReLU . We write the output of an MLP layer with parameter θ 𝜃 \theta as MLP θ ( Z ) subscript MLP 𝜃 𝑍 \texttt{MLP}_{\theta}(Z) . Defining MLP \|\cdot\|_{\texttt{MLP}} as θ MLP := W 1 op + W 2 op assign subscript norm 𝜃 MLP subscript norm subscript 𝑊 1 op subscript norm subscript 𝑊 2 op \|\theta\|_{\texttt{MLP}}:=\|W_{1}\|_{\texttt{op}}+\|W_{2}\|_{\texttt{op}} and ResNet ( f , Z ) := Z + f ( Z ) assign ResNet 𝑓 𝑍 𝑍 𝑓 𝑍 \texttt{ResNet}(f,Z):=Z+f(Z) , we can define an L 𝐿 L -layer Transformer with parameter θ = ( θ ( l m ) , θ ( l a ) ) l [ L ] 𝜃 subscript superscript 𝜃 𝑙 𝑚 superscript 𝜃 𝑙 𝑎 𝑙 delimited-[] 𝐿 \theta=(\theta^{(lm)},\theta^{(la)})_{l\in[L]} as

TF θ ( Z ) := Z ( L ) , assign subscript TF 𝜃 𝑍 superscript 𝑍 𝐿 \displaystyle\texttt{TF}_{\theta}(Z):=Z^{(L)},

where the output Z ( L ) superscript 𝑍 𝐿 Z^{(L)} is defined iteratively from Z ( 0 ) = clip R ( Z ) := min ( R , max ( R , Z ) ) superscript 𝑍 0 subscript clip 𝑅 𝑍 assign 𝑅 𝑅 𝑍 Z^{(0)}=\texttt{clip}_{R}(Z):=\min(-R,\max(R,Z)) and

Z ( l ) = clip R ( ResNet ( MLP θ ( l a ) , ResNet ( M-(L)SA θ ( l m ) , Z ( l 1 ) ) ) ) , superscript 𝑍 𝑙 subscript clip 𝑅 ResNet subscript MLP superscript 𝜃 𝑙 𝑎 ResNet subscript M-(L)SA superscript 𝜃 𝑙 𝑚 superscript 𝑍 𝑙 1 \displaystyle Z^{(l)}=\texttt{clip}_{R}\left(\texttt{ResNet}\left(\texttt{MLP}_{\theta^{(la)}},\texttt{ResNet}\left(\texttt{M-(L)SA}_{\theta^{(lm)}},Z^{(l-1)}\right)\right)\right),

for some R > 0 𝑅 0 R>0 . We define a class of Transformers with certain parameters as Θ d , L , M , d , B TF := { θ = ( θ ( l m ) , θ ( l a ) ) l [ L ] , m [ M ] : θ TF B TF } assign subscript Θ 𝑑 𝐿 𝑀 superscript 𝑑 subscript 𝐵 TF conditional-set 𝜃 subscript superscript 𝜃 𝑙 𝑚 superscript 𝜃 𝑙 𝑎 formulae-sequence 𝑙 delimited-[] 𝐿 𝑚 delimited-[] 𝑀 subscript norm 𝜃 TF subscript 𝐵 TF \Theta_{d,L,M,d^{\prime},B_{\texttt{TF}}}:=\{\theta=(\theta^{(lm)},\theta^{(la)})_{l\in[L],m\in[M]}:\|\theta\|_{\texttt{TF}}\leq B_{\texttt{TF}}\} , where M 𝑀 M is the number of heads of self-attention,

θ TF := max l [ L ] { θ ( l a ) M-(L)SA + θ ( l m ) MLP } , assign subscript norm 𝜃 TF subscript 𝑙 delimited-[] 𝐿 subscript norm superscript 𝜃 𝑙 𝑎 M-(L)SA subscript norm superscript 𝜃 𝑙 𝑚 MLP \displaystyle\|\theta\|_{\texttt{TF}}:=\max_{l\in[L]}\left\{\|\theta^{(la)}\|_{\texttt{M-(L)SA}}+\|\theta^{(lm)}\|_{\texttt{MLP}}\right\}, (9)

and B TF > 0 subscript 𝐵 TF 0 B_{\texttt{TF}}>0 is some constant. When it is clear from the context, we may omit the subscripts and write it as Θ Θ \Theta for simplicity. We assume R 𝑅 R to be sufficiently large such that clip does not take effect on any of our approximation results.

A.2 In-Context Learning

In-context learning is an emergent behavior of LLMs (Brown et al., 2020 ) , which means that these models can adapt and learn from a limited number of examples provided within their immediate input context. In in-context learning, the prompt is usually constituted by a length of T 𝑇 T in-context (independent) examples ( x t , y t ) t [ T ] subscript subscript 𝑥 𝑡 subscript 𝑦 𝑡 𝑡 delimited-[] 𝑇 (x_{t},y_{t})_{t\in[T]} and ( T + 1 ) 𝑇 1 (T+1) -th input x T + 1 subscript 𝑥 𝑇 1 x_{T+1} , so the LLM ( ( z t ) t [ T ] , x T + 1 ) LLM subscript subscript 𝑧 𝑡 𝑡 delimited-[] 𝑇 subscript 𝑥 𝑇 1 \text{LLM}((z_{t})_{t\in[T]},x_{T+1}) provides the inference of y T + 1 subscript 𝑦 𝑇 1 y_{T+1} , where z t = ( x t , y t ) subscript 𝑧 𝑡 subscript 𝑥 𝑡 subscript 𝑦 𝑡 z_{t}=(x_{t},y_{t}) .

A.3 Online Learning Algorithms

Follow-the-regularized-leader (FTRL).

The Follow-the-Regularized-Leader (FTRL) algorithm (Shalev-Shwartz, 2007 ) is an iterative method that updates policy based on the observed data and a regularization term. The idea is to choose the next policy that minimizes the sum of the past losses and a regularization term.

Mathematically, given a sequence of loss vectors 1 , 2 , , t subscript 1 subscript 2 subscript 𝑡 {\ell}_{1},{\ell}_{2},\dots,{\ell}_{t} , the FTRL algorithm updates the policy π 𝜋 \pi at each time step t 𝑡 t as follows:

π t + 1 = arg min π Π ( i = 1 t i , π + R ( π ) ) , subscript 𝜋 𝑡 1 subscript 𝜋 Π superscript subscript 𝑖 1 𝑡 subscript 𝑖 𝜋 𝑅 𝜋 \pi_{t+1}=\arg\min_{\pi\in\Pi}\left(\sum_{i=1}^{t}\langle{\ell}_{i},\pi\rangle+R(\pi)\right),

where R ( π ) 𝑅 𝜋 R(\pi) is a regularization term. The regularization term R ( π ) 𝑅 𝜋 R(\pi) is introduced to prevent overfitting and can be any function that penalizes the complexity of the model. A function R ( π ) 𝑅 𝜋 R(\pi) is said to be λ 𝜆 \lambda -strongly convex with respect to a norm \|\cdot\| if for all π , π Π 𝜋 superscript 𝜋 Π \pi,\pi^{\prime}\in\Pi :

R ( π ) R ( π ) + R ( π ) , π π + λ 2 π π 2 2 . 𝑅 𝜋 𝑅 superscript 𝜋 𝑅 superscript 𝜋 𝜋 superscript 𝜋 𝜆 2 superscript subscript norm 𝜋 superscript 𝜋 2 2 R(\pi)\geq R(\pi^{\prime})+\langle\nabla R(\pi^{\prime}),\pi-\pi^{\prime}\rangle+\frac{\lambda}{2}\|\pi-\pi^{\prime}\|_{2}^{2}.

A key property that ensures the convergence and stability of the FTRL algorithm is the strong convexity of the regularization term R ( π ) 𝑅 𝜋 R(\pi) . Strong convexity of R ( π ) 𝑅 𝜋 R(\pi) ensures that the optimization problem in FTRL has a unique solution. The FTRL algorithm’s flexibility allows it to encompass a wide range of online learning algorithms, from gradient-based methods like online gradient descent to decision-making algorithms like Hedge (Freund and Schapire, 1997 ) .

Connection to online gradient descent (OGD).

The Online Gradient Descent (OGD) (Cesa-Bianchi et al., 1996 ) algorithm is a special case of the FTRL algorithm when the regularization term is the L 2 subscript 𝐿 2 L_{2} -norm square, i.e., R ( π ) = 1 2 π 2 2 𝑅 𝜋 1 2 superscript subscript norm 𝜋 2 2 R(\pi)=\frac{1}{2}\|\pi\|_{2}^{2} and Π = d Π superscript 𝑑 \Pi=\mathbb{R}^{d} . In OGD, at each time step t 𝑡 t , the policy π 𝜋 \pi is updated using the gradient of the loss function:

π t + 1 = π t t . subscript 𝜋 𝑡 1 subscript 𝜋 𝑡 subscript 𝑡 \pi_{t+1}=\pi_{t}-{\ell}_{t}.

Therefore, the connection between FTRL and OGD can be seen by observing that the update rule for FTRL with L 2 subscript 𝐿 2 L_{2} regularization can be derived from the OGD update rule.

Connection to the Hedge algorithm.

The Hedge algorithm (Freund and Schapire, 1997 ) (also referred to as the Multiplicative Weight Update algorithm (Arora et al., 2012 ) ) is an online learning algorithm designed for problems where the learner has to choose from a set of actions (denoted as 𝒜 𝒜 \mathcal{A} ) at each time step and suffers a loss based on the chosen action. The FTRL framework can be used to derive the Hedge algorithm by considering an entropy regularization term. Specifically, the regularization term is the negative entropy R ( π ) = j [ d ] π j log π j 𝑅 𝜋 subscript 𝑗 delimited-[] 𝑑 subscript 𝜋 𝑗 subscript 𝜋 𝑗 R(\pi)=\sum_{j\in[d]}\pi_{j}\log\pi_{j} (where d 𝑑 d is the dimension of policy π 𝜋 \pi ), then the FTRL update rule yields the Hedge algorithm as

π ( t + 1 ) j = π t j exp ( t j π t j ) i [ d ] exp ( t i π t i ) subscript 𝜋 𝑡 1 𝑗 subscript 𝜋 𝑡 𝑗 subscript 𝑡 𝑗 subscript 𝜋 𝑡 𝑗 subscript 𝑖 delimited-[] 𝑑 subscript 𝑡 𝑖 subscript 𝜋 𝑡 𝑖 \displaystyle\pi_{(t+1)j}=\pi_{tj}\frac{\exp(-{\ell}_{tj}\pi_{tj})}{\sum_{i\in[d]}\exp(-{\ell}_{ti}\pi_{ti})}

for j [ d ] 𝑗 delimited-[] 𝑑 j\in[d] .

Follow-the-perturbed-leader (FTPL).

Given a sequence of loss vectors 1 , 2 , , t 1 subscript 1 subscript 2 subscript 𝑡 1 {\ell}_{1},{\ell}_{2},\dots,{\ell}_{t-1} , the Follow-the-Perturbed-Leader algorithm (Kalai and Vempala, 2005 ) updates the policy π 𝜋 \pi at each time step t 𝑡 t by incorporating a perturbation vector ϵ t subscript italic-ϵ 𝑡 \epsilon_{t} . This perturbation is sampled from a pre-defined distribution. The policy π t subscript 𝜋 𝑡 \pi_{t} for the next time step is chosen by solving the following optimization problem:

π t = 𝔼 [ arg min π Π ϵ t , π + i = 1 t 1 i , π ] . subscript 𝜋 𝑡 𝔼 delimited-[] subscript 𝜋 Π subscript italic-ϵ 𝑡 𝜋 superscript subscript 𝑖 1 𝑡 1 subscript 𝑖 𝜋 \displaystyle\pi_{t}=\mathbb{E}\left[\arg\min_{\pi\in\Pi}\langle\epsilon_{t},\pi\rangle+\sum_{i=1}^{t-1}\langle{\ell}_{i},\pi\rangle\right]. (10)

Here ϵ t subscript italic-ϵ 𝑡 \epsilon_{t} introduces randomness to the decision-making.

Relationship between FTRL and FTPL.

The FTRL and FTPL algorithms are deeply related. For example, FTPL with perturbations of Gumbel distribution and FTRL with Entropy Regularization (i.e., Hedge) are equivalent. In general, for the FTPL algorithm with any perturbation distribution, one can always find an FTRL algorithm with a particular regularization such that their update rule is equivalent. However, this relationship does not hold vice versa. For example, Hofbauer and Sandholm ( 2002 ) shows that for FTRL with log barrier regularization, there does not exist an equivalent perturbation distribution for FTPL.

Restarting techniques for non-stationary online learning.

For non-stationary online learning problems, one common technique is restarting : one restarts the standard online learning algorithm periodically (Besbes et al., 2014 ) (see also e.g., Wei and Luo ( 2021 ); Mao et al. ( 2020 ) ). After each restarting operation, the algorithm will ignore the previous history and execute as if it is the beginning of the interaction with the environment. Since the variation of the loss sequences is bounded, loss sequences between two consecutive restarting operations can be regarded as being almost stationary , which makes achieving an overall sublinear dynamic regret guarantee possible.

A.4 Why Focusing on Linear Loss Function?

We note that focusing on the linear loss function f t ( π ) := t , π assign subscript 𝑓 𝑡 𝜋 subscript 𝑡 𝜋 f_{t}(\pi):=\langle{\ell}_{t},\pi\rangle does not lose much of generality. Specifically, for the general convex loss function ( f t ) t [ T ] subscript subscript 𝑓 𝑡 𝑡 delimited-[] 𝑇 (f_{t})_{t\in[T]} , we have f t ( π 𝒜 , t ) f t ( π ) f t ( π 𝒜 , t ) , π 𝒜 , t π subscript 𝑓 𝑡 subscript 𝜋 𝒜 𝑡 subscript 𝑓 𝑡 𝜋 subscript 𝑓 𝑡 subscript 𝜋 𝒜 𝑡 subscript 𝜋 𝒜 𝑡 𝜋 f_{t}(\pi_{\mathscr{A},t})-f_{t}(\pi)\leq\langle\nabla f_{t}(\pi_{\mathscr{A},t}),\pi_{\mathscr{A},t}-\pi\rangle for any π Π 𝜋 Π \pi\in\Pi , which indicates

Regret 𝒜 ( ( f t ) t [ T ] ) t = 1 T 𝔼 [ f t ( π 𝒜 , t ) , π 𝒜 , t ] inf π Π t = 1 T 𝔼 [ f t ( π 𝒜 , t ) , π ] . subscript Regret 𝒜 subscript subscript 𝑓 𝑡 𝑡 delimited-[] 𝑇 superscript subscript 𝑡 1 𝑇 𝔼 delimited-[] subscript 𝑓 𝑡 subscript 𝜋 𝒜 𝑡 subscript 𝜋 𝒜 𝑡 subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 𝔼 delimited-[] subscript 𝑓 𝑡 subscript 𝜋 𝒜 𝑡 𝜋 \displaystyle\text{Regret}_{\mathscr{A}}\left((f_{t})_{t\in[T]}\right)\leq\sum_{t=1}^{T}\mathbb{E}[\langle\nabla f_{t}(\pi_{\mathscr{A},t}),\pi_{\mathscr{A},t}\rangle]-\inf_{\pi\in\Pi}\sum_{t=1}^{T}\mathbb{E}[\langle\nabla f_{t}(\pi_{\mathscr{A},t}),\pi\rangle].

Therefore, one can regard the loss vector ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 ({\ell}_{t})_{t\in[T]} as t := f t ( π 𝒜 , t ) assign subscript 𝑡 subscript 𝑓 𝑡 subscript 𝜋 𝒜 𝑡 {\ell}_{t}:=\nabla f_{t}(\pi_{\mathscr{A},t}) for t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , and control the actual regret by studying the linear loss function (Hazan, 2016 ) . The same argument on the general convex f t subscript 𝑓 𝑡 f_{t} can be applied to the dynamic-regret metric as well. In sum, an algorithm designed for online linear optimization can be adapted to solve online convex optimization, with the understanding that the instance received at round t 𝑡 t corresponds to the gradient of the convex function evaluated at the policy at that round.

A.5 Six Representative General-Sum Games

In game theory, there are six representative two-player general-sum games (Robinson and Goforth, 2005 ) . Firstly, consider the win-win game represented by matrices A = ( 1 4 1 2 ) 𝐴 matrix 1 4 1 2 A=\begin{pmatrix}1&4\\ 1&2\end{pmatrix} and B = ( 1 4 1 2 ) 𝐵 matrix 1 4 1 2 B=\begin{pmatrix}1&4\\ 1&2\end{pmatrix} for players A and B, respectively. This setup fosters a cooperative dynamic, as both players receive identical payoffs, encouraging strategies that benefit both parties equally.

In contrast, the prisoner’s dilemma , depicted by payoff matrices A = ( 1 3 2 4 ) 𝐴 matrix 1 3 2 4 A=\begin{pmatrix}1&3\\ 2&4\end{pmatrix} and B = ( 4 3 2 1 ) 𝐵 matrix 4 3 2 1 B=\begin{pmatrix}4&3\\ 2&1\end{pmatrix} , illustrates the conflict between individual and collective rationality, where players are tempted to pursue individual gain at the collective’s expense, often resulting in suboptimal outcomes for both.

In the unfair game , represented by A = ( 2 1 3 4 ) 𝐴 matrix 2 1 3 4 A=\begin{pmatrix}2&1\\ 3&4\end{pmatrix} and B = ( 4 3 1 2 ) 𝐵 matrix 4 3 1 2 B=\begin{pmatrix}4&3\\ 1&2\end{pmatrix} , the asymmetry in the payoff structure places one player at a disadvantage, regardless of the chosen strategy. This imbalance often reflects real-world scenarios where power or information asymmetry affects decision-making.

The cyclic game , with matrices A = ( 3 1 2 4 ) 𝐴 matrix 3 1 2 4 A=\begin{pmatrix}3&1\\ 2&4\end{pmatrix} and B = ( 3 4 2 1 ) 𝐵 matrix 3 4 2 1 B=\begin{pmatrix}3&4\\ 2&1\end{pmatrix} , presents a scenario where no stable equilibrium exists. The best strategy for each player changes in response to the other’s actions, leading to a continuous cycle of strategy adaptation without a clear resolution.

The biased game , depicted by A = ( 3 2 1 4 ) 𝐴 matrix 3 2 1 4 A=\begin{pmatrix}3&2\\ 1&4\end{pmatrix} and B = ( 4 2 1 3 ) 𝐵 matrix 4 2 1 3 B=\begin{pmatrix}4&2\\ 1&3\end{pmatrix} , inherently favors one player, often reflecting situations where external factors or inherent advantages influence outcomes, leading to consistently unequal payoffs.

Finally, the second-best game , with payoff matrices A = ( 1 2 3 4 ) 𝐴 matrix 1 2 3 4 A=\begin{pmatrix}1&2\\ 3&4\end{pmatrix} and B = ( 1 4 3 2 ) 𝐵 matrix 1 4 3 2 B=\begin{pmatrix}1&4\\ 3&2\end{pmatrix} , encapsulates scenarios where players settle for less-than-optimal outcomes due to constraints like risk aversion or limited options. This often results in players choosing safer, albeit less rewarding, strategies.

Each of these games exemplifies distinct aspects of strategic decision-making and interactions. From cooperative to competitive and fair to biased scenarios, these matrices provide a rich landscape for exploring the nuances of decision-making behavior in game theory.

Appendix B Deferred Results and Proofs in Section 3

B.1 Ablation Study on Prompts

Ablation study on online learning.

To systematically understand the effects of our prompt on the final performance of the LLM agent, we create three different variants of our prompt and report the regret by using different prompts in Figure 12 . Specifically, for Ablation1 , we remove examples to illustrate the game rules. For Ablation2 , we remove the number of iterations. For Ablation3 , we incorporate some hints for the LLM for decision-making, including the hints to suggest it to pay attention to the loss history, to behave more greedily at the end of an episode, and also to explain the reason of its decision step-by-step. The latter hint is a popular technique in prompt engineering known as the Chain-of-Thought prompting (Wei et al., 2022b ) . Finally, we recall that d 𝑑 d is the number of actions in all prompts.

Refer to caption
Figure 12 : Ablation study on our prompt design.

Original prompt

You are solving a decision-making problem for 25 rounds.

There are $d$ number of action (which is 0 to $d-1$).

At each round, you need to choose a policy, it specifies your probability to choose each action.

This policy should be $d$-dimensional, and the sum of its components should equal 1.

After that, you will be shown the reward vector for choosing each action.

Remember that this reward vector is decided by the external system and can be potentially

different for different rounds.

It is not decided by what policies you have chosen. The reward vector is also $d$-dimensional.

It represents the reward of choosing action from 0 to $d-1$.

For example, a reward vector of [0.8, 3.2] means reward for action_0 is 0.8

and the reward for action_1 is 3.2.

Then your reward for this round will be calculated according to the reward of each action and

your probability of choosing each action.

For example, if you choose the policy [0.2, 0.8] and get the reward vector [1, 2],

then your expected reward is 0.2*1 + 0.8*2=1.8

Your goal is to maximize your accumulative expected reward.

You can adjust your policy based on the reward vectors for all previous rounds.

You’re required to provide your policy in numeric format.

Your response’s last line should be formatted as ’Policy: [your $d$-dimensional policy]’.

Ablation1 : no examples

You are solving a decision-making problem for 25 rounds.

There are $d$ number of action (which is 0 to $d-1$).

At each round, you need to choose a policy, it specifies your probability to choose each action.

This policy should be $d$-dimensional, and the sum of its components should equal 1.

After that, you will be shown the reward vector for choosing each action.

Remember that this reward vector is decided by the external system and can be potentially

different for different rounds.

It is not decided by what policies you have chosen. The reward vector is also $d$-dimensional.

It represents the reward of choosing action from 0 to $d-1$.

Then your reward for this round will be calculated according to the reward of each action and

your probability of choosing each action.

Your goal is to maximize your accumulative expected reward.

You can adjust your policy based on the reward vectors for all previous rounds.

You’re required to provide your policy in numeric format.

Your response’s last line should be formatted as ’Policy: [your $d$-dimensional policy]’.

Ablation2 : no round information

You are solving a decision-making problem.

There are $d$ number of action (which is 0 to $d-1$).

At each round, you need to choose a policy, it specifies your probability to choose each action.

This policy should be $d$-dimensional, and the sum of its components should equal 1.

After that, you will be shown the reward vector for choosing each action.

Remember that this reward vector is decided by the external system and can be potentially

different for different rounds.

It is not decided by what policies you have chosen. The reward vector is also $d$-dimensional.

It represents the reward of choosing action from 0 to $d-1$.

For example, a reward vector of [0.8, 3.2] means reward for action_0 is 0.8

and the reward for action_1 is 3.2.

Then your reward for this round will be calculated according to the reward of each action and

your probability of choosing each action.

For example, if you choose the policy [0.2, 0.8] and get the reward vector [1, 2],

then your expected reward is 0.2*1 + 0.8*2=1.8

Your goal is to maximize your accumulative expected reward.

You can adjust your policy based on the reward vectors for all previous rounds.

You’re required to provide your policy in numeric format.

Your response’s last line should be formatted as ’Policy: [your $d$-dimensional policy]’.

Ablation3 : adding hints

You are solving a decision-making problem for 25 rounds.

There are $d$ number of action (which is 0 to $d-1$).

At each round, you need to choose a policy, it specifies your probability to choose each action.

This policy should be $d$-dimensional, and the sum of its components should equal 1.

After that, you will be shown the reward vector for choosing each action.

Remember that this reward vector is decided by the external system and can be potentially

different for different rounds.

It is not decided by what policies you have chosen. The reward vector is also $d$-dimensional.

It represents the reward of choosing action from 0 to $d-1$.

For example, a reward vector of [0.8, 3.2] means reward for action_0 is 0.8

and the reward for action_1 is 3.2.

Then your reward for this round will be calculated according to the reward of each action and

your probability of choosing each action.

For example, if you choose the policy [0.2, 0.8] and get the reward vector [1, 2],

then your expected reward is 0.2*1 + 0.8*2=1.8

Your goal is to maximize your accumulative expected reward.

You can adjust your policy based on the reward vectors for all previous rounds.

You’re required to provide your policy in numeric format.

Your response’s last line should be formatted as ’Policy: [your $d$-dimensional policy]’.

Let’s think step by step. Explicitly examining history is important.

Please explain how you chose the policy by guessing what reward you might receive

for each action according to the history.

You should explore for first several rounds and behave greedily for later rounds, for example,

choosing one action with probability more than 0.99.

Please also explain whether you are behaving more greedily and less greedily

by explicitly considering the policy you just used for last round.

We can see in Figure 12 that the performances of LLM agents are consistent under different variants of the prompts.

Ablation study on repeated games.

For the game setting, we also investigate whether explicitly informing LLM agents that they are ‘‘playing a repeated matrix game with some other opponents’’ would affect the performance. Therefore, we evaluate three different prompts by informing LLM agents that they are playing a matrix game , solving multi-arm bandit , or solving general decision-making problems, in the first line of the prompt. We show the performance of such three prompts in Figure 13 , where it is seen that LLM agents’ performance on repeated games is consistent among these variants of the prompts.

Refer to caption
Figure 13 : Regret of GPT-4 for repeated games under 3 3 3 different prompt ablations. Its performance is consistent among three different prompts.

B.2 Proof for Proposition 1

Proof.

Under the null hypothesis H 0 subscript 𝐻 0 H_{0} , the probability p 𝑝 p that Regret 𝒜 ( ( f τ ) τ [ t ] ) / t Regret 𝒜 ( ( f τ ) τ [ t + 1 ] ) / ( t + 1 ) > 0 subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 𝑡 subscript Regret 𝒜 subscript subscript 𝑓 𝜏 𝜏 delimited-[] 𝑡 1 𝑡 1 0 \text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t]}\right)/t-\text{Regret}_{\mathscr{A}}\left((f_{\tau})_{\tau\in[t+1]}\right)/(t+1)>0 is less than 1 2 1 2 \frac{1}{2} . Therefore, if we consider the event ( s , T ) 𝑠 𝑇 \mathcal{E}(s,T) , we have

H 0 ( ( s , T ) ) = k = s T 1 p s ( 1 p ) T 1 s ( T 1 k ) 1 2 T 1 k = s T 1 ( T 1 k ) subscript subscript 𝐻 0 𝑠 𝑇 superscript subscript 𝑘 𝑠 𝑇 1 superscript 𝑝 𝑠 superscript 1 𝑝 𝑇 1 𝑠 matrix 𝑇 1 𝑘 1 superscript 2 𝑇 1 superscript subscript 𝑘 𝑠 𝑇 1 matrix 𝑇 1 𝑘 \displaystyle\mathbb{P}_{H_{0}}(\mathcal{E}(s,T))=\sum_{k=s}^{T-1}p^{s}(1-p)^{T-1-s}\begin{pmatrix}T-1\\ k\end{pmatrix}\leq\frac{1}{2^{T-1}}\sum_{k=s}^{T-1}\begin{pmatrix}T-1\\ k\end{pmatrix}

since s T 1 2 𝑠 𝑇 1 2 s\geq\frac{T-1}{2} . ∎

B.3 Results of GPT-4 Turbo

When preparing our draft, we are aware of a more advanced version of GPT-4, GPT-4 Turbo. Although GPT-4 has successfully exhibited no-regret behaviors in most problems, we also evaluate and report the performance of its more advanced version here. Specifically, we compare GPT-4 and GPT-4 Turbo for the repeated game scenario in Figure 14 , where we validate that GPT-4 Turbo performs either comparably with or better than GPT-4.

Refer to caption
Refer to caption
Refer to caption
Figure 14 : Comparisons of GPT-4 Turbo with GPT-4 and GPT-3.5 Turbo. Due to the symmetry of agents in randomly generated games, we report the regret of one agent for ease of presentation. We can see that GPT-4 Turbo performs on par with or better than GPT-4.

B.4 LLM Agents’ Explanation on Their Output Policies

We here present two examples of LLMs’ explanations of their output policies, showing how they reason in their decision-making.

Output Example 1

Looking at the history of the reward vectors, it is clear to see that the reward for choosing

action 0 is consistently higher than that for choosing action 1. This pattern is evident across

all 5 rounds of the game.

In the first round, the difference in rewards is 3.52, this remained relatively stable with

the difference in the last round being 3.52 as well. It’s indicative that action 0 consistently

brings a higher return than action 1.

Therefore, I’ll lean more towards selecting action 0 in my policy.

However, acting deterministically, that is always choosing action 0, would introduce the risk

of becoming too predictable and letting the opponent exploit this predictability

So, in order to maintain some unpredictability while still capitalizing on the higher average

return of action 0, I’ll set my policy as follows: select action 0 with a probability of 0.8

and action 1 with a probability of 0.2.

This way, the majority of the time I’m choosing the action that gives a higher reward, but I’m

also occasionally choosing the other action to keep the opponent guessing, and avoid

any predictable patterns that can be exploited

So, my policy is as follows: Policy: [0.8, 0.2]

Output Example 2

The total reward from action_1 is 130.92 and the average reward is around 10.07.

On the other hand, the total reward from action_0 is 33.418 and the average reward is around 8.354.

Although action_0 has a less average reward, its highest reward is 26.977,

which is considerably high and shows its potential.

For this round, I want to continue to explore action_0 but also significantly exploit action 1

as it reliably gives higher reward.

So, I’d like to decrease the probability of selecting action_0 compared to last time

but still keep it in the policy as a possible option.

The policy for the previous round was not provided but assuming it was more balanced,

I’m now going to lean more towards action_1.

So, the policy will be: [0.3, 0.7]. Here, I’m behaving more greedily compared to earlier rounds,

as we are choosing the action which has been giving us higher rewards with more probability.

As observed from the output, the LLM agent tends to use the history of the reward vectors by looking at the “sum/average of the accumulated rewards”, and tends to introduce randomization in decision-making, in order to be “less predictable and exploitable”. These are several key components in achieving no-regret in online learning and games (Cesa-Bianchi and Lugosi, 2006 ) , explaining the empirical evidence we had in Section 3 .

Appendix C Deferred Results and Proofs in Section 4

C.1 Deferred Proof of Observation 1

Proof of 1 .

For given ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 (\ell_{t})_{t\in[T]} ,

t = 1 T π t Π P LLM ( π t ( i ) i [ t 1 ] ) t , π t 𝑑 π t t = 1 T π t Π ( P data ( π t ( i ) i [ t 1 ] ) + ϵ ) t , π t 𝑑 π t superscript subscript 𝑡 1 𝑇 subscript subscript 𝜋 𝑡 Π subscript 𝑃 LLM conditional subscript 𝜋 𝑡 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 1 subscript 𝑡 subscript 𝜋 𝑡 differential-d subscript 𝜋 𝑡 superscript subscript 𝑡 1 𝑇 subscript subscript 𝜋 𝑡 Π subscript 𝑃 data conditional subscript 𝜋 𝑡 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 1 italic-ϵ subscript 𝑡 subscript 𝜋 𝑡 differential-d subscript 𝜋 𝑡 \displaystyle\sum_{t=1}^{T}\int_{\pi_{t}\in\Pi}P_{\text{LLM}}(\pi_{t}\mid(\ell_{i})_{i\in[t-1]})\langle\ell_{t},\pi_{t}\rangle d\pi_{t}\leq\sum_{t=1}^{T}\int_{\pi_{t}\in\Pi}\left(P_{\text{data}}(\pi_{t}\mid(\ell_{i})_{i\in[t-1]})+\epsilon\right)\langle\ell_{t},\pi_{t}\rangle d\pi_{t}

holds, where we use the convention of P LLM ( π t ( 0 ) ) := P LLM ( π t ) assign subscript 𝑃 LLM conditional subscript 𝜋 𝑡 subscript 0 subscript 𝑃 LLM subscript 𝜋 𝑡 P_{\text{LLM}}(\pi_{t}\mid(\ell_{0})):=P_{\text{LLM}}(\pi_{t}) and P data ( π t ( 0 ) ) := P data ( π t ) assign subscript 𝑃 data conditional subscript 𝜋 𝑡 subscript 0 subscript 𝑃 data subscript 𝜋 𝑡 P_{\text{data}}(\pi_{t}\mid(\ell_{0})):=P_{\text{data}}(\pi_{t}) . Hence,

Regret LLM ( ( t ) t [ T ] ) subscript Regret LLM subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\text{Regret}_{\text{LLM}}((\ell_{t})_{t\in[T]}) = t = 1 T π t Π P LLM ( π t ( i ) i [ t 1 ] ) t , π t 𝑑 π t inf π Π t = 1 T t , π absent superscript subscript 𝑡 1 𝑇 subscript subscript 𝜋 𝑡 Π subscript 𝑃 LLM conditional subscript 𝜋 𝑡 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 1 subscript 𝑡 subscript 𝜋 𝑡 differential-d subscript 𝜋 𝑡 subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝜋 \displaystyle=\sum_{t=1}^{T}\int_{\pi_{t}\in\Pi}P_{\text{LLM}}(\pi_{t}\mid(\ell_{i})_{i\in[t-1]})\langle\ell_{t},\pi_{t}\rangle d\pi_{t}-\inf_{\pi\in\Pi}\sum_{t=1}^{T}\langle\ell_{t},\pi\rangle
t = 1 T π t Π ( P data ( π t ( i ) i [ t 1 ] ) + ϵ ) t , π t 𝑑 π t inf π Π t = 1 T t , π absent superscript subscript 𝑡 1 𝑇 subscript subscript 𝜋 𝑡 Π subscript 𝑃 data conditional subscript 𝜋 𝑡 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 1 italic-ϵ subscript 𝑡 subscript 𝜋 𝑡 differential-d subscript 𝜋 𝑡 subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝜋 \displaystyle\leq\sum_{t=1}^{T}\int_{\pi_{t}\in\Pi}\left(P_{\text{data}}(\pi_{t}\mid(\ell_{i})_{i\in[t-1]})+\epsilon\right)\langle\ell_{t},\pi_{t}\rangle d\pi_{t}-\inf_{\pi\in\Pi}\sum_{t=1}^{T}\langle\ell_{t},\pi\rangle
= t = 1 T π t Π ( P data ( π t ( i ) i [ t 1 ] ) ) t , π t 𝑑 π t inf π Π t = 1 T t , π + t = 1 T π t Π t , ϵ π t 𝑑 π t absent superscript subscript 𝑡 1 𝑇 subscript subscript 𝜋 𝑡 Π subscript 𝑃 data conditional subscript 𝜋 𝑡 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 1 subscript 𝑡 subscript 𝜋 𝑡 differential-d subscript 𝜋 𝑡 subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝜋 superscript subscript 𝑡 1 𝑇 subscript subscript 𝜋 𝑡 Π subscript 𝑡 italic-ϵ subscript 𝜋 𝑡 differential-d subscript 𝜋 𝑡 \displaystyle=\sum_{t=1}^{T}\int_{\pi_{t}\in\Pi}\left(P_{\text{data}}(\pi_{t}\mid(\ell_{i})_{i\in[t-1]})\right)\langle\ell_{t},\pi_{t}\rangle d\pi_{t}-\inf_{\pi\in\Pi}\sum_{t=1}^{T}\langle\ell_{t},\pi\rangle+\sum_{t=1}^{T}\int_{\pi_{t}\in\Pi}\langle\ell_{t},\epsilon\pi_{t}\rangle d\pi_{t}
Regret data ( ( t ) t [ T ] ) + ϵ p π q T absent subscript Regret data subscript subscript 𝑡 𝑡 delimited-[] 𝑇 italic-ϵ subscript norm 𝑝 subscript norm 𝜋 𝑞 𝑇 \displaystyle\leq\text{Regret}_{\text{data}}((\ell_{t})_{t\in[T]})+\epsilon\|\ell\|_{p}\|\pi\|_{q}T

where 1 p + 1 q = 1 1 𝑝 1 𝑞 1 \frac{1}{p}+\frac{1}{q}=1 and p , q 1 𝑝 𝑞 1 p,q\geq 1 . Similarly, we can establish the lower bound for Regret LLM ( ( t ) t [ T ] ) subscript Regret LLM subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \text{Regret}_{\text{LLM}}((\ell_{t})_{t\in[T]}) . To prove the result for the dynamic-regret case, we can simply change the term inf π Π t = 1 T t , π subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝜋 \inf_{\pi\in\Pi}\sum_{t=1}^{T}\langle\ell_{t},\pi\rangle in the above derivation to t = 1 T inf π Π t , π superscript subscript 𝑡 1 𝑇 subscript infimum 𝜋 Π subscript 𝑡 𝜋 \sum_{t=1}^{T}\inf_{\pi\in\Pi}\langle\ell_{t},\pi\rangle . ∎

C.2 Deferred Proof of Lemma 1

See 1

Proof.

Note that

( z | x 1 : N t ) = 1 : t ( z , 1 : t | x 1 : N t ) 𝑑 1 : t = 1 : t ( 1 : t | x 1 : N t ) ( z | x 1 : N t , 1 : t ) 𝑑 1 : t . conditional 𝑧 subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript subscript : 1 𝑡 𝑧 conditional subscript : 1 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 differential-d subscript : 1 𝑡 subscript subscript : 1 𝑡 conditional subscript : 1 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 conditional 𝑧 subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript : 1 𝑡 differential-d subscript : 1 𝑡 \displaystyle\mathbb{P}(z{\,|\,}x_{1:N_{t}})=\int_{\ell_{1:t}}\mathbb{P}(z,\ell_{1:t}{\,|\,}x_{1:N_{t}})d{\ell_{1:t}}=\int_{\ell_{1:t}}\mathbb{P}(\ell_{1:t}{\,|\,}x_{1:N_{t}})\mathbb{P}(z{\,|\,}x_{1:N_{t}},\ell_{1:t})d\ell_{1:t}.

For ( 1 : t | x 1 : N t ) conditional subscript : 1 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 \mathbb{P}(\ell_{1:t}{\,|\,}x_{1:N_{t}}) , since we have assumed the existence of function f 𝑓 f to decode 1 : t subscript : 1 𝑡 \ell_{1:t} from x 1 : N t subscript 𝑥 : 1 subscript 𝑁 𝑡 x_{1:N_{t}} , it holds that

( 1 : t | x 1 : N t ) = i = 1 t δ ( i f ( x N i 1 + 1 : N i ) ) , conditional subscript : 1 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 superscript subscript product 𝑖 1 𝑡 𝛿 subscript 𝑖 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 \displaystyle\mathbb{P}(\ell_{1:t}{\,|\,}x_{1:N_{t}})=\prod_{i=1}^{t}\delta\left(\ell_{i}-f(x_{N_{i-1}+1:N_{i}})\right),

where we use δ 𝛿 \delta to denote the d 𝑑 d -dimensional Dirac-delta function. For ( z | x 1 : N t , 1 : t ) conditional 𝑧 subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript : 1 𝑡 \mathbb{P}(z{\,|\,}x_{1:N_{t}},\ell_{1:t}) , by 1 , it holds that

( z , x 1 : N t , 1 : t ) = ( z , 1 : t ) ( x 1 : N t | 1 : t ) , 𝑧 subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript : 1 𝑡 𝑧 subscript : 1 𝑡 conditional subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript : 1 𝑡 \displaystyle\mathbb{P}(z,x_{1:N_{t}},\ell_{1:t})=\mathbb{P}(z,\ell_{1:t})\mathbb{P}(x_{1:N_{t}}{\,|\,}\ell_{1:t}),

which leads to ( x 1 : N t | 1 : t ) = ( x 1 : N t | 1 : t , z ) conditional subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript : 1 𝑡 conditional subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript : 1 𝑡 𝑧 \mathbb{P}(x_{1:N_{t}}{\,|\,}\ell_{1:t})=\mathbb{P}(x_{1:N_{t}}{\,|\,}\ell_{1:t},z) by Bayes rule. This implies that the random variable x 1 : N t subscript 𝑥 : 1 subscript 𝑁 𝑡 x_{1:N_{t}} and z 𝑧 z are independent conditioned on 1 : t subscript : 1 𝑡 \ell_{1:t} . Therefore, it holds that ( z | x 1 : N t , 1 : t ) = ( z | 1 : t ) conditional 𝑧 subscript 𝑥 : 1 subscript 𝑁 𝑡 subscript : 1 𝑡 conditional 𝑧 subscript : 1 𝑡 \mathbb{P}(z{\,|\,}x_{1:N_{t}},\ell_{1:t})=\mathbb{P}(z{\,|\,}\ell_{1:t}) . Finally, we can compute

( z | x 1 : N t ) conditional 𝑧 subscript 𝑥 : 1 subscript 𝑁 𝑡 \displaystyle\mathbb{P}(z{\,|\,}x_{1:N_{t}}) = 1 : t ( z , 1 : t | x 1 : N t ) 𝑑 1 : t = 1 : t i = 1 t δ ( i f ( x N i 1 + 1 : N i ) ) ( z | 1 : t ) d 1 : t absent subscript subscript : 1 𝑡 𝑧 conditional subscript : 1 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 differential-d subscript : 1 𝑡 subscript subscript : 1 𝑡 superscript subscript product 𝑖 1 𝑡 𝛿 subscript 𝑖 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 conditional 𝑧 subscript : 1 𝑡 𝑑 subscript : 1 𝑡 \displaystyle=\int_{\ell_{1:t}}\mathbb{P}(z,\ell_{1:t}{\,|\,}x_{1:N_{t}})d{\ell_{1:t}}=\int_{\ell_{1:t}}\prod_{i=1}^{t}\delta(\ell_{i}-f(x_{N_{i-1}+1:N_{i}}))\mathbb{P}(z{\,|\,}\ell_{1:t})d\ell_{1:t}
= ( z | ( i = f ( x N i 1 + 1 : N i ) ) i [ t ] ) . absent conditional 𝑧 subscript subscript 𝑖 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑖 delimited-[] 𝑡 \displaystyle=\mathbb{P}\left(z{\,|\,}\left(\ell_{i}=f(x_{N_{i-1}+1:N_{i}})\right)_{i\in[t]}\right).

Based on this, we conclude that

( g ( x N t + 1 : N t + 1 ) | x 1 : N t ) conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 \displaystyle\mathbb{P}(g(x_{N_{t}+1:N_{t+1}}){\,|\,}x_{1:N_{t}}) = z ( g ( x N t + 1 : N t + 1 ) | z , x 1 : N t ) ( z | x 1 : N t ) 𝑑 z absent subscript 𝑧 conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 𝑧 subscript 𝑥 : 1 subscript 𝑁 𝑡 conditional 𝑧 subscript 𝑥 : 1 subscript 𝑁 𝑡 differential-d 𝑧 \displaystyle=\int_{z}\mathbb{P}(g(x_{N_{t}+1:N_{t+1}}){\,|\,}z,x_{1:N_{t}})\mathbb{P}(z{\,|\,}x_{1:N_{t}})dz
= z ( g ( x N t + 1 : N t + 1 ) | z ) ( z | { i = f ( x N i 1 + 1 : N i ) } i [ t ] ) 𝑑 z absent subscript 𝑧 conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 𝑧 conditional 𝑧 subscript subscript 𝑖 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑖 delimited-[] 𝑡 differential-d 𝑧 \displaystyle=\int_{z}\mathbb{P}(g(x_{N_{t}+1:N_{t+1}}){\,|\,}z)\mathbb{P}(z{\,|\,}\{\ell_{i}=f(x_{N_{i-1}+1:N_{i}})\}_{i\in[t]})dz
= ( g ( x N t + 1 : N t + 1 ) | ( i = f ( x N i 1 + 1 : N i ) ) i [ t ] ) absent conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript subscript 𝑖 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑖 delimited-[] 𝑡 \displaystyle=\mathbb{P}\left(g(x_{N_{t}+1:N_{t+1}}){\,|\,}\left(\ell_{i}=f(x_{N_{i-1}+1:N_{i}})\right)_{i\in[t]}\right)

where the first equality is by the independence between x N t + 1 : N t + 1 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 x_{N_{t}+1:N_{t+1}} and x 1 : N t subscript 𝑥 : 1 subscript 𝑁 𝑡 x_{1:N_{t}} conditioned on z 𝑧 z , due to 1 . Therefore, it suffices to consider the probability of ( a | 1 : t ) conditional 𝑎 subscript : 1 𝑡 \mathbb{P}(a{\,|\,}\ell_{1:t}) only, in order to analyze ( g ( x N t + 1 : N t + 1 ) | x 1 : N t ) conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 \mathbb{P}(g(x_{N_{t}+1:N_{t+1}}){\,|\,}x_{1:N_{t}}) , where we recall the definition that a = g ( x N t + 1 : N t + 1 ) 𝑎 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 a=g(x_{N_{t}+1:N_{t+1}}) . Since z 𝒩 ( 𝟎 d , σ 2 I ) similar-to 𝑧 𝒩 subscript 0 𝑑 superscript 𝜎 2 𝐼 z\sim\mathcal{N}(\boldsymbol{0}_{d},\sigma^{2}I) , and i z 𝒩 ( z , σ 2 I ) similar-to conditional subscript 𝑖 𝑧 𝒩 𝑧 superscript 𝜎 2 𝐼 {\ell}_{i}\mid z\sim\mathcal{N}(z,\sigma^{2}I) , we have

z 1 : t 𝒩 ( 1 t + 1 i [ t ] i , σ 2 t + 1 I ) , similar-to conditional 𝑧 subscript : 1 𝑡 𝒩 1 𝑡 1 subscript 𝑖 delimited-[] 𝑡 subscript 𝑖 superscript 𝜎 2 𝑡 1 𝐼 \displaystyle z\mid\ell_{1:t}\sim\mathcal{N}\left(\frac{1}{t+1}\sum_{i\in[t]}{\ell}_{i},\frac{\sigma^{2}}{t+1}I\right), (11)

by the posterior distribution of Gaussian distribution. Now we conclude that

( a | 1 : t ) conditional 𝑎 subscript : 1 𝑡 \displaystyle\mathbb{P}(a{\,|\,}\ell_{1:t}) = z ( a | z , 1 : t ) ( z | 1 : t ) 𝑑 z = z ( a | z ) ( z | 1 : t ) 𝑑 z absent subscript 𝑧 conditional 𝑎 𝑧 subscript : 1 𝑡 conditional 𝑧 subscript : 1 𝑡 differential-d 𝑧 subscript 𝑧 conditional 𝑎 𝑧 conditional 𝑧 subscript : 1 𝑡 differential-d 𝑧 \displaystyle=\int_{z}\mathbb{P}(a{\,|\,}z,\ell_{1:t})\mathbb{P}(z{\,|\,}\ell_{1:t})dz=\int_{z}\mathbb{P}(a{\,|\,}z)\mathbb{P}(z{\,|\,}\ell_{1:t})dz
= z 𝟙 ( a arg min a 𝒜 z a ) ( z | 1 : t ) d z = z 𝟙 ( a arg min a 𝒜 ( σ t + 1 ϵ + 1 t + 1 i [ t ] i ) a ) ( ϵ ) d ϵ \displaystyle=\int_{z}\mathbbm{1}(a\in\operatorname*{arg\,min}_{a^{\prime}\in\mathcal{A}}z_{a^{\prime}})\mathbb{P}(z{\,|\,}\ell_{1:t})dz=\int_{z}\mathbbm{1}\left(a\in\operatorname*{arg\,min}_{a^{\prime}\in\mathcal{A}}\left(\frac{\sigma}{\sqrt{t+1}}\epsilon+\frac{1}{t+1}\sum_{i\in[t]}\ell_{i}\right)_{a^{\prime}}\right)\mathbb{P}(\epsilon)d\epsilon
= z 𝟙 ( a arg min a 𝒜 ( σ t + 1 ϵ + i [ t ] i ) a ) ( ϵ ) d ϵ = ( a arg min a 𝒜 ( σ t + 1 ϵ + i [ t ] i ) a ) \displaystyle=\int_{z}\mathbbm{1}\left(a\in\operatorname*{arg\,min}_{a^{\prime}\in\mathcal{A}}\left(\sigma\sqrt{t+1}\epsilon+\sum_{i\in[t]}\ell_{i}\right)_{a^{\prime}}\right)\mathbb{P}(\epsilon)d\epsilon=\mathbb{P}\left(a\in\operatorname*{arg\,min}_{a^{\prime}\in\mathcal{A}}\left(\sigma\sqrt{t+1}\epsilon+\sum_{i\in[t]}\ell_{i}\right)_{a^{\prime}}\right)
= P q u a n t a l σ t + 1 ( a | 1 : t ) , absent superscript subscript 𝑃 𝑞 𝑢 𝑎 𝑛 𝑡 𝑎 𝑙 𝜎 𝑡 1 conditional 𝑎 subscript : 1 𝑡 \displaystyle=P_{quantal}^{\sigma\sqrt{t+1}}(a{\,|\,}\ell_{1:t}),

where ( ϵ ) = 𝒩 ( 𝟎 d , I ) italic-ϵ 𝒩 subscript 0 𝑑 𝐼 \mathbb{P}(\epsilon)=\mathcal{N}(\boldsymbol{0}_{d},I) . This completes the proof. ∎

C.3 Relationship between FTPL and Definition 2

Fact 1 .

Performing generalized quantal response of Definition 2 at every iteration t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] w.r.t. history loss vectors 1 : t 1 subscript : 1 𝑡 1 \ell_{1:t-1} is essentially executing an FTPL algorithm.

Proof.

Before we move to the proof, we will define the random variable which has distribution P noise subscript 𝑃 noise P_{\text{noise}} as Z noise subscript 𝑍 noise Z_{\text{noise}} . Note that at round t 2 𝑡 2 t\geq 2 (as the policy at round t = 1 𝑡 1 t=1 is fixed), we have

P quantal η t 1 ( a | 1 : t 1 ) := ( a arg min a 𝒜 ( i = 1 t 1 i + η t 1 ϵ ) ( a ) ) assign superscript subscript 𝑃 quantal subscript 𝜂 𝑡 1 conditional 𝑎 subscript : 1 𝑡 1 𝑎 subscript arg min superscript 𝑎 𝒜 superscript subscript 𝑖 1 𝑡 1 subscript 𝑖 subscript 𝜂 𝑡 1 italic-ϵ superscript 𝑎 \displaystyle P_{\text{quantal}}^{\eta_{t-1}}(a{\,|\,}\ell_{1:t-1}):=\mathbb{P}\left(a\in\operatorname*{arg\,min}_{a^{\prime}\in\mathcal{A}}~{}\left(\sum_{i=1}^{t-1}{\ell}_{i}+\eta_{t-1}\epsilon\right)(a^{\prime})\right) (12)

which is exactly the case when ϵ t subscript italic-ϵ 𝑡 \epsilon_{t} in Equation 10 satisfies ϵ t = 𝑑 η t 1 ϵ subscript italic-ϵ 𝑡 𝑑 subscript 𝜂 𝑡 1 italic-ϵ \epsilon_{t}\overset{d}{=}\eta_{t-1}\epsilon . ∎

C.4 Deferred Proof of Theorem 1

See 1

Proof.

Note that

max { q j { 𝒱 j 1 Δ ( 𝒱 ) } } j [ N t + 1 ] 𝔼 x 1 : N t + 1 P t t e x t j = 1 N t + 1 log q j ( x j | x 1 : j 1 ) subscript subscript subscript 𝑞 𝑗 superscript 𝒱 𝑗 1 Δ 𝒱 𝑗 delimited-[] subscript 𝑁 𝑡 1 subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 superscript subscript 𝑗 1 subscript 𝑁 𝑡 1 subscript 𝑞 𝑗 conditional subscript 𝑥 𝑗 subscript 𝑥 : 1 𝑗 1 \displaystyle\max_{\left\{q_{j}\in\left\{\mathcal{V}^{j-1}\rightarrow\Delta(\mathcal{V})\right\}\right\}_{j\in[N_{t+1}]}}\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\sum_{j=1}^{N_{t+1}}\log q_{j}\left(x_{j}{\,|\,}x_{1:j-1}\right) = max q Δ ( 𝒱 N t + 1 ) 𝔼 x 1 : N t + 1 P t t e x t log q ( x 1 : N t + 1 ) absent subscript 𝑞 Δ superscript 𝒱 subscript 𝑁 𝑡 1 subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 𝑞 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 \displaystyle=\max_{q\in\Delta(\mathcal{V}^{N_{t+1}})}\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\log q(x_{1:N_{t+1}})
= max q Δ ( 𝒱 N t + 1 ) KL ( P t t e x t | | q ) + 𝔼 x 1 : N t + 1 P t t e x t [ P t t e x t ( x 1 : N t + 1 ) ] , \displaystyle=\max_{q\in\Delta(\mathcal{V}^{N_{t+1}})}-\text{KL}(P_{t}^{text}{\,|\,}{\,|\,}q)+\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}[P_{t}^{text}(x_{1:N_{t+1}})],

where KL ( q | | p ) \text{KL}(q{\,|\,}{\,|\,}p) denotes the Kullback–Leibler divergence between two distributions p , q 𝑝 𝑞 p,q . Now we define LLM θ ( x 1 : N t + 1 ) = t = 1 N t + 1 LLM θ ( x j | x 1 : j 1 ) subscript LLM 𝜃 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 superscript subscript product 𝑡 1 subscript 𝑁 𝑡 1 subscript LLM 𝜃 conditional subscript 𝑥 𝑗 subscript 𝑥 : 1 𝑗 1 \text{LLM}_{\theta}(x_{1:N_{t+1}})=\prod_{t=1}^{N_{t+1}}\text{LLM}_{\theta}(x_{j}{\,|\,}x_{1:j-1}) . It is easy to verify that LLM θ ( x 1 : N t + 1 ) Δ ( 𝒱 N t + 1 ) subscript LLM 𝜃 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 Δ superscript 𝒱 subscript 𝑁 𝑡 1 \text{LLM}_{\theta}(x_{1:N_{t+1}})\in\Delta(\mathcal{V}^{N_{t+1}}) , i.e., it also defines a valid joint distribution over tokens. Therefore, we have

max θ Θ 𝔼 x 1 : N t + 1 P t t e x t j = 1 N t + 1 log LLM θ ( x j | x 1 : j 1 ) = max θ Θ 𝔼 x 1 : N t + 1 P t t e x t log LLM θ ( x 1 : N t + 1 ) . subscript 𝜃 Θ subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 superscript subscript 𝑗 1 subscript 𝑁 𝑡 1 subscript LLM 𝜃 conditional subscript 𝑥 𝑗 subscript 𝑥 : 1 𝑗 1 subscript 𝜃 Θ subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 subscript LLM 𝜃 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 \displaystyle\max_{\theta\in\Theta}\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\sum_{j=1}^{N_{t+1}}\log\text{LLM}_{\theta}\left(x_{j}{\,|\,}x_{1:j-1}\right)=\max_{\theta\in\Theta}\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\log\text{LLM}_{\theta}(x_{1:N_{t+1}}).

Now, due to our assumption that

max θ Θ 𝔼 x 1 : N t + 1 P t t e x t j = 1 N t + 1 log LLM θ ( x j | x 1 : j 1 ) = max { q j { 𝒱 j 1 Δ ( 𝒱 ) } } j [ N t + 1 ] 𝔼 x 1 : N t + 1 P t t e x t j = 1 N t + 1 log q j ( x j | x 1 : j 1 ) , subscript 𝜃 Θ subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 superscript subscript 𝑗 1 subscript 𝑁 𝑡 1 subscript LLM 𝜃 conditional subscript 𝑥 𝑗 subscript 𝑥 : 1 𝑗 1 subscript subscript subscript 𝑞 𝑗 superscript 𝒱 𝑗 1 Δ 𝒱 𝑗 delimited-[] subscript 𝑁 𝑡 1 subscript 𝔼 similar-to subscript 𝑥 : 1 subscript 𝑁 𝑡 1 subscript superscript 𝑃 𝑡 𝑒 𝑥 𝑡 𝑡 superscript subscript 𝑗 1 subscript 𝑁 𝑡 1 subscript 𝑞 𝑗 conditional subscript 𝑥 𝑗 subscript 𝑥 : 1 𝑗 1 \displaystyle\max_{\theta\in\Theta}\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\sum_{j=1}^{N_{t+1}}\log\text{LLM}_{\theta}\left(x_{j}{\,|\,}x_{1:j-1}\right)=\max_{\left\{q_{j}\in\left\{\mathcal{V}^{j-1}\rightarrow\Delta(\mathcal{V})\right\}\right\}_{j\in[N_{t+1}]}}\mathbb{E}_{x_{1:N_{t+1}}\sim P^{text}_{t}}\sum_{j=1}^{N_{t+1}}\log q_{j}\left(x_{j}{\,|\,}x_{1:j-1}\right),

we conclude that

min θ Θ KL ( P t t e x t | | LLM θ ) = min q Δ ( 𝒱 N t + 1 ) KL ( P t t e x t | | q ) = 0 , \displaystyle\min_{\theta\in\Theta}\text{KL}(P_{t}^{text}{\,|\,}{\,|\,}\text{LLM}_{\theta})=\min_{q\in\Delta(\mathcal{V}^{N_{t+1}})}\text{KL}(P_{t}^{text}{\,|\,}{\,|\,}q)=0,

which implies that LLM θ = P t t e x t subscript LLM superscript 𝜃 superscript subscript 𝑃 𝑡 𝑡 𝑒 𝑥 𝑡 \text{LLM}_{\theta^{\star}}=P_{t}^{text} . Correspondingly, if we define LLM θ ( x N t + 1 : N t + 1 | x 1 : N t ) subscript LLM superscript 𝜃 conditional subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 \text{LLM}_{\theta^{\star}}(x_{N_{t}+1:N_{t+1}}{\,|\,}x_{1:N_{t}}) to be the distribution induced by the joint distribution LLM θ ( x 1 : N t + 1 ) subscript LLM superscript 𝜃 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 \text{LLM}_{\theta^{\star}}(x_{1:N_{t+1}}) , it holds that

LLM θ ( x N t + 1 : N t + 1 | x 1 : N t ) = ( x N t + 1 : N t + 1 | x 1 : N t ) . subscript LLM superscript 𝜃 conditional subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 conditional subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 \displaystyle\text{LLM}_{\theta^{\star}}(x_{N_{t}+1:N_{t+1}}{\,|\,}x_{1:N_{t}})=\mathbb{P}(x_{N_{t}+1:N_{t+1}}{\,|\,}x_{1:N_{t}}).

In other words, intuitively, LLM θ subscript LLM superscript 𝜃 \text{LLM}_{\theta^{\star}} has learned the corresponding pre-training distribution perfectly. Note that this has been a common assumption in the Bayesian perspective of ICL (Xie et al., 2022 ; Lee et al., 2023 ; Zhang et al., 2023b ) . Therefore, to analyze the actions taken by LLM θ subscript LLM superscript 𝜃 \text{LLM}_{\theta^{\star}} , it suffices to consider ( g ( x N t + 1 : N t + 1 ) | x 1 : N t ) conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 \mathbb{P}(g(x_{N_{t}+1:N_{t+1}}){\,|\,}x_{1:N_{t}}) , which is equal to P q u a n t a l σ t + 1 ( g ( x N t + 1 : N t + 1 ) | { f ( x N i 1 + 1 : N i ) } i [ t ] ) superscript subscript 𝑃 𝑞 𝑢 𝑎 𝑛 𝑡 𝑎 𝑙 𝜎 𝑡 1 conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑓 subscript 𝑥 : subscript 𝑁 𝑖 1 1 subscript 𝑁 𝑖 𝑖 delimited-[] 𝑡 P_{quantal}^{\sigma\sqrt{t+1}}\left(g(x_{N_{t}+1:N_{t+1}}){\,\big{|}\,}\left\{f(x_{N_{i-1}+1:N_{i}})\right\}_{i\in[t]}\right) by Lemma 1 . Therefore, we proved that LLM θ subscript LLM superscript 𝜃 \text{LLM}_{\theta^{\star}} is essentially mimicking the well-known no-regret algorithm, FTPL with perturbation distribution as 𝒩 ( 𝟎 d , σ 2 t I ) 𝒩 subscript 0 𝑑 superscript 𝜎 2 𝑡 𝐼 \mathcal{N}(\boldsymbol{0}_{d},\sigma^{2}tI) for round t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , according to Equation 12 of 1 , for which we can establish the corresponding regret guarantee for each case:

(1) Combining the above result with Lemma 2 , we can derive the regret bound for online learning with full-information feedback.

(2) Combining the above result with Lemma 2 and Lemma 4 , we get that

D-Regret LLM θ ( ( i ) i [ T ] ) min Δ T [ T ] 2 T Δ T C Δ T log d + 2 Δ T V T , \displaystyle\text{D-Regret}_{\text{LLM}{{}_{\theta^{\star}}}}(({\ell}_{i})_{i\in[T]})\leq\min_{\Delta_{T}\in[T]}\frac{2T}{\Delta_{T}}C\sqrt{\Delta_{T}\log d}+2\Delta_{T}V_{T},

for some constant C 𝐶 C . We firstly consider the following problem

min u > 0 2 T u C u log d + 2 u V T , subscript 𝑢 0 2 𝑇 𝑢 𝐶 𝑢 𝑑 2 𝑢 subscript 𝑉 𝑇 \displaystyle\min_{u>0}\frac{2T}{u}C\sqrt{u\log d}+2uV_{T},

where the optimal solution is u = ( C 2 T 2 log d 4 V T 2 ) 1 / 3 superscript 𝑢 superscript superscript 𝐶 2 superscript 𝑇 2 𝑑 4 superscript subscript 𝑉 𝑇 2 1 3 u^{\star}=\left(\frac{C^{2}T^{2}\log d}{4V_{T}^{2}}\right)^{1/3} . Therefore, if we have u [ 1 , T ] superscript 𝑢 1 𝑇 u^{\star}\in[1,T] , we can choose Δ T = u subscript Δ 𝑇 superscript 𝑢 \Delta_{T}=\lceil u^{\star}\rceil , which results in a regret bound of

D-Regret LLM θ ( ( i ) i [ T ] ) 2 T u C log d + 4 u V T = 𝒪 ( ( log d V T ) 1 / 3 T 2 / 3 ) . \displaystyle\text{D-Regret}_{\text{LLM}{{}_{\theta^{\star}}}}(({\ell}_{i})_{i\in[T]})\leq\frac{2T}{\sqrt{u^{\star}}}C\sqrt{\log d}+4u^{\star}V_{T}=\mathcal{O}\left((\log d\ V_{T})^{1/3}T^{2/3}\right).

Now we check the conditions for u [ 1 , T ] superscript 𝑢 1 𝑇 u^{\star}\in[1,T] . It is direct to see that since V T T subscript 𝑉 𝑇 𝑇 V_{T}\leq T , u 1 superscript 𝑢 1 u^{\star}\geq 1 holds as long as d 𝑑 d is sufficiently large. To ensure u T superscript 𝑢 𝑇 u^{\star}\leq T , we get the condition V T C log d 4 T subscript 𝑉 𝑇 𝐶 𝑑 4 𝑇 V_{T}\geq C\sqrt{\frac{\log d}{4T}} , which holds as long as T 𝑇 T is large enough.

(3) Combining the above result with Lemma 3 , we can prove a regret guarantee for online learning with bandit feedback.

(4) Combining this result with Lemma 3 and Lemma 4 , it holds that

𝔼 [ D-Regret LLM θ ( ( i ) i [ T ] ) ] min Δ T [ T ] 2 T Δ T C ( log d ) 1 2 d Δ T 1 2 + 1 log T log Δ T + 2 Δ T V T , \displaystyle\mathbb{E}[\text{D-Regret}_{\text{LLM}{{}_{\theta^{\star}}}}(({\ell}_{i})_{i\in[T]})]\leq\min_{\Delta_{T}\in[T]}\frac{2T}{\Delta_{T}}C(\log d)^{\frac{1}{2}}d\Delta_{T}^{\frac{1}{2}+\frac{1}{\log T}}\log\Delta_{T}+2\Delta_{T}V_{T},

for some constant C 𝐶 C . By adopting a similar analysis as that of (2), we choose u = ( C T 2 d 2 V T 2 ) 1 / 3 superscript 𝑢 superscript superscript 𝐶 superscript 𝑇 2 superscript 𝑑 2 superscript subscript 𝑉 𝑇 2 1 3 u^{\star}=\left(\frac{C^{\prime}T^{2}d^{2}}{V_{T}^{2}}\right)^{1/3} for some constant C superscript 𝐶 C^{\prime} . If u [ 1 , T ] superscript 𝑢 1 𝑇 u^{\star}\in[1,T] , we choose Δ T = u subscript Δ 𝑇 superscript 𝑢 \Delta_{T}=\lceil u^{\star}\rceil and derive the following regret:

𝔼 [ D-Regret LLM θ ( ( i ) i [ T ] ) ] 𝒪 ( ( T 2 d 2 V T ) 1 / 3 ( log d ) 1 / 2 T 1 / log T log T ) . \displaystyle\mathbb{E}[\text{D-Regret}_{\text{LLM}{{}_{\theta^{\star}}}}(({\ell}_{i})_{i\in[T]})]\leq\mathcal{O}\left((T^{2}d^{2}V_{T})^{1/3}(\log d)^{1/2}T^{1/\log T}\log T\right).

Now we check the condition of u [ 1 , T ] superscript 𝑢 1 𝑇 u^{\star}\in[1,T] . Note that since V T T subscript 𝑉 𝑇 𝑇 V_{T}\leq T , u 1 superscript 𝑢 1 u^{\star}\geq 1 holds as long as d 𝑑 d is sufficiently large. For u T superscript 𝑢 𝑇 u^{\star}\leq T , we have V T C d 2 T subscript 𝑉 𝑇 superscript 𝐶 superscript 𝑑 2 𝑇 V_{T}\geq\sqrt{\frac{C^{\prime}d^{2}}{T}} , which holds as long as T 𝑇 T is large enough.

Now, we present Lemma 2 - Lemma 4 . Before proceeding, we assume t B = 1 subscript norm subscript 𝑡 𝐵 1 \|\ell_{t}\|_{\infty}\leq B=1 for simplicity of presentations hereafter. The results and proof are not affected by the constant bound B 𝐵 B .

Lemma 2 (Regret guarantee of FTPL with full-information feedback) .

Suppose the noise distribution of FTPL satisfies that ϵ t 𝒩 ( 𝟎 d , ζ t 2 I ) similar-to subscript italic-ϵ 𝑡 𝒩 subscript 0 𝑑 superscript subscript 𝜁 𝑡 2 𝐼 \epsilon_{t}\sim\mathcal{N}(\boldsymbol{0}_{d},\zeta_{t}^{2}I) in Equation 10 and ζ t = σ t subscript 𝜁 𝑡 𝜎 𝑡 \zeta_{t}=\sigma\sqrt{t} , then for online learning with full-information feedback,

Regret FTPL ( ( i ) i [ T ] ) 4 ( σ + 1 σ ) T log d = 𝒪 ( T log d ) . subscript Regret FTPL subscript subscript 𝑖 𝑖 delimited-[] 𝑇 4 𝜎 1 𝜎 𝑇 𝑑 𝒪 𝑇 𝑑 \displaystyle\text{Regret}_{\text{FTPL}}(({\ell}_{i})_{i\in[T]})\leq 4\left(\sigma+\frac{1}{\sigma}\right)\sqrt{T\log d}=\mathcal{O}(\sqrt{T\log d}).
Proof.

By Theorem 8 of Abernethy et al. ( 2014 ) , we have

Regret FTPL ( ( i ) i [ T ] ) 2 log d ( η T + t = 1 T 1 η t t 2 ) . subscript Regret FTPL subscript subscript 𝑖 𝑖 delimited-[] 𝑇 2 𝑑 subscript 𝜂 𝑇 superscript subscript 𝑡 1 𝑇 1 subscript 𝜂 𝑡 superscript subscript norm subscript 𝑡 2 \displaystyle\text{Regret}_{\text{FTPL}}(({\ell}_{i})_{i\in[T]})\leq\sqrt{2\log d}\left(\eta_{T}+\sum_{t=1}^{T}\frac{1}{\eta_{t}}\|\ell_{t}\|_{\infty}^{2}\right).

Therefore, plugging ζ t = σ t subscript 𝜁 𝑡 𝜎 𝑡 \zeta_{t}=\sigma\sqrt{t} and t 2 1 superscript subscript norm subscript 𝑡 2 1 \|\ell_{t}\|_{\infty}^{2}\leq 1 provides

Regret FTPL ( ( i ) i [ T ] ) 2 log d ( σ T + t = 1 T 1 σ t ) 4 ( σ + 1 σ ) T log d , subscript Regret FTPL subscript subscript 𝑖 𝑖 delimited-[] 𝑇 2 𝑑 𝜎 𝑇 superscript subscript 𝑡 1 𝑇 1 𝜎 𝑡 4 𝜎 1 𝜎 𝑇 𝑑 \displaystyle\text{Regret}_{\text{FTPL}}(({\ell}_{i})_{i\in[T]})\leq\sqrt{2\log d}\left(\sigma\sqrt{T}+\sum_{t=1}^{T}\frac{1}{\sigma\sqrt{t}}\right)\leq 4\left(\sigma+\frac{1}{\sigma}\right)\sqrt{T\log d},

completing the proof. ∎

Lemma 3 (Regret guarantee of FTPL with bandit feedback) .

Suppose the noise distribution of FTPL satisfies that ϵ t 𝒩 ( 𝟎 d , ζ t 2 I ) similar-to subscript italic-ϵ 𝑡 𝒩 subscript 0 𝑑 superscript subscript 𝜁 𝑡 2 𝐼 \epsilon_{t}\sim\mathcal{N}(\boldsymbol{0}_{d},\zeta_{t}^{2}I) in Equation 10 and ζ t = σ t subscript 𝜁 𝑡 𝜎 𝑡 \zeta_{t}=\sigma\sqrt{t} , then for online learning with bandit feedback,

𝔼 [ Regret FTPL ( ( i ) i [ T ] ) ] 𝒪 ( ( log d ) 1 2 d T 1 2 + 1 log T log T ) . 𝔼 delimited-[] subscript Regret FTPL subscript subscript 𝑖 𝑖 delimited-[] 𝑇 𝒪 superscript 𝑑 1 2 𝑑 superscript 𝑇 1 2 1 𝑇 𝑇 \displaystyle\mathbb{E}[\text{Regret}_{\text{FTPL}}(({\ell}_{i})_{i\in[T]})]\leq\mathcal{O}((\log d)^{\frac{1}{2}}dT^{\frac{1}{2}+\frac{1}{\log T}}\log T).
Proof.

The proof of the bandit problem is more complex. We first define the following notation. We denote G t = t = 1 t t subscript 𝐺 𝑡 superscript subscript superscript 𝑡 1 𝑡 subscript superscript 𝑡 G_{t}=\sum_{t^{\prime}=1}^{t}-\ell_{t^{\prime}} , G ^ t = t = 1 t ^ t subscript ^ 𝐺 𝑡 superscript subscript superscript 𝑡 1 𝑡 subscript ^ superscript 𝑡 \widehat{G}_{t}=\sum_{t^{\prime}=1}^{t}-\widehat{\ell}_{t^{\prime}} , Φ ( G ) = max π π , G Φ 𝐺 subscript 𝜋 𝜋 𝐺 \Phi(G)=\max_{\pi}\langle\pi,G\rangle , Φ t ( G ) = 𝔼 ϵ 𝒩 ( 𝟎 d , I ) Φ ( G + ζ t ϵ ) subscript Φ 𝑡 𝐺 subscript 𝔼 similar-to italic-ϵ 𝒩 subscript 0 𝑑 𝐼 Φ 𝐺 subscript 𝜁 𝑡 italic-ϵ \Phi_{t}(G)=\mathbb{E}_{\epsilon\sim\mathcal{N}(\boldsymbol{0}_{d},I)}\Phi(G+\zeta_{t}\epsilon) , and D Φ t subscript 𝐷 subscript Φ 𝑡 D_{\Phi_{t}} to be the Bregman divergence with respect to Φ t subscript Φ 𝑡 \Phi_{t} , where we recall the construction of the empirical estimator ^ t subscript ^ superscript 𝑡 \widehat{\ell}_{t^{\prime}} of t subscript superscript 𝑡 \ell_{t^{\prime}} in Section 3.2 . By Li and Tewari ( 2017 ) , π t = Φ t ( G ^ t ) subscript 𝜋 𝑡 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 \pi_{t}=\nabla\Phi_{t}(\widehat{G}_{t}) . Now due to the convexity of Φ Φ \Phi ,

Φ ( G T ) = Φ ( 𝔼 [ G ^ T ] ) 𝔼 [ Φ ( G ^ T ) ] . Φ subscript 𝐺 𝑇 Φ 𝔼 delimited-[] subscript ^ 𝐺 𝑇 𝔼 delimited-[] Φ subscript ^ 𝐺 𝑇 \displaystyle\Phi(G_{T})=\Phi(\mathbb{E}[\widehat{G}_{T}])\leq\mathbb{E}[\Phi(\widehat{G}_{T})].

Therefore,

𝔼 [ Regret FTPL ( ( i ) i [ T ] ) ] 𝔼 delimited-[] subscript Regret FTPL subscript subscript 𝑖 𝑖 delimited-[] 𝑇 \displaystyle\mathbb{E}[\text{Regret}_{\text{FTPL}}(({\ell}_{i})_{i\in[T]})] = Φ ( G T ) 𝔼 [ t = 1 T π t , ^ t ] 𝔼 [ Φ ( G ^ T ) t = 1 T π t , ^ t ] . absent Φ subscript 𝐺 𝑇 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 subscript 𝜋 𝑡 subscript ^ 𝑡 𝔼 delimited-[] Φ subscript ^ 𝐺 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝜋 𝑡 subscript ^ 𝑡 \displaystyle=\Phi(G_{T})-\mathbb{E}\left[\sum_{t=1}^{T}\langle\pi_{t},-\widehat{\ell}_{t}\rangle\right]\leq\mathbb{E}\left[\Phi(\widehat{G}_{T})-\sum_{t=1}^{T}\langle\pi_{t},-\widehat{\ell}_{t}\rangle\right].

By recalling the definition of the Bregman divergence, we have

t = 1 T π t , ^ t superscript subscript 𝑡 1 𝑇 subscript 𝜋 𝑡 subscript ^ 𝑡 \displaystyle-\sum_{t=1}^{T}\langle\pi_{t},-\widehat{\ell}_{t}\rangle = t = 1 T Φ t ( G ^ t ) , ^ t = t = 1 T Φ t ( G ^ t ) , G ^ t G ^ t 1 absent superscript subscript 𝑡 1 𝑇 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 subscript ^ 𝑡 superscript subscript 𝑡 1 𝑇 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 subscript ^ 𝐺 𝑡 subscript ^ 𝐺 𝑡 1 \displaystyle=-\sum_{t=1}^{T}\langle\nabla\Phi_{t}(\widehat{G}_{t}),-\widehat{\ell}_{t}\rangle=-\sum_{t=1}^{T}\langle\nabla\Phi_{t}(\widehat{G}_{t}),\widehat{G}_{t}-\widehat{G}_{t-1}\rangle
= t = 1 T D Φ t ( G ^ t , G ^ t 1 ) + Φ t ( G ^ t 1 ) Φ t ( G ^ t ) . absent superscript subscript 𝑡 1 𝑇 subscript 𝐷 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 subscript ^ 𝐺 𝑡 1 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 1 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 \displaystyle=\sum_{t=1}^{T}D_{\Phi_{t}}(\widehat{G}_{t},\widehat{G}_{t-1})+\Phi_{t}(\widehat{G}_{t-1})-\Phi_{t}(\widehat{G}_{t}).

Therefore,

𝔼 [ Regret FTPL ( ( i ) i [ T ] ) ] 𝔼 delimited-[] subscript Regret FTPL subscript subscript 𝑖 𝑖 delimited-[] 𝑇 \displaystyle\mathbb{E}\left[\text{Regret}_{\text{FTPL}}(({\ell}_{i})_{i\in[T]})\right]
𝔼 [ t = 1 T D Φ t ( G ^ t , G ^ t 1 ) ] ( i ) + 𝔼 [ t = 1 T Φ t ( G ^ t 1 ) Φ t 1 ( G ^ t 1 ) ] ( i i ) + 𝔼 [ Φ ( G ^ T ) Φ T ( G ^ T ) ] ( i i i ) . absent subscript 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 subscript 𝐷 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 subscript ^ 𝐺 𝑡 1 𝑖 subscript 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 1 subscript Φ 𝑡 1 subscript ^ 𝐺 𝑡 1 𝑖 𝑖 subscript 𝔼 delimited-[] Φ subscript ^ 𝐺 𝑇 subscript Φ 𝑇 subscript ^ 𝐺 𝑇 𝑖 𝑖 𝑖 \displaystyle\leq\underbrace{\mathbb{E}\left[\sum_{t=1}^{T}D_{\Phi_{t}}(\widehat{G}_{t},\widehat{G}_{t-1})\right]}_{(i)}+\underbrace{\mathbb{E}\left[\sum_{t=1}^{T}\Phi_{t}(\widehat{G}_{t-1})-\Phi_{t-1}(\widehat{G}_{t-1})\right]}_{(ii)}+\underbrace{\mathbb{E}\left[\Phi(\widehat{G}_{T})-\Phi_{T}(\widehat{G}_{T})\right]}_{(iii)}.

( i i i ) 0 𝑖 𝑖 𝑖 0 (iii)\leq 0 due to the convexity of Φ Φ \Phi . For ( i i ) 𝑖 𝑖 (ii) , we use Lemma 10 of Abernethy et al. ( 2014 ) to obtain

𝔼 [ t = 1 T Φ t ( G ^ t 1 ) Φ t 1 ( G ^ t 1 ) ] ζ T 𝔼 ϵ [ Φ ( ϵ ) ] 𝒪 ( 2 T log d ) . 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 1 subscript Φ 𝑡 1 subscript ^ 𝐺 𝑡 1 subscript 𝜁 𝑇 subscript 𝔼 italic-ϵ delimited-[] Φ italic-ϵ 𝒪 2 𝑇 𝑑 \mathbb{E}\left[\sum_{t=1}^{T}\Phi_{t}(\widehat{G}_{t-1})-\Phi_{t-1}(\widehat{G}_{t-1})\right]\leq\zeta_{T}\mathbb{E}_{\epsilon}[\Phi(\epsilon)]\leq\mathcal{O}(\sqrt{2T\log d}).

For ( i ) 𝑖 (i) , by Theorem 8 of Li and Tewari ( 2017 ) , for any α ( 0 , 1 ) 𝛼 0 1 \alpha\in(0,1) , the following holds:

𝔼 [ t = 1 T D Φ t ( G ^ t , G ^ t 1 ) ] t = 1 T ζ t α 1 4 d α ( 1 α ) 4 d α ( 1 α ) 𝒪 ( T 1 + α 2 ) . 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 subscript 𝐷 subscript Φ 𝑡 subscript ^ 𝐺 𝑡 subscript ^ 𝐺 𝑡 1 superscript subscript 𝑡 1 𝑇 superscript subscript 𝜁 𝑡 𝛼 1 4 𝑑 𝛼 1 𝛼 4 𝑑 𝛼 1 𝛼 𝒪 superscript 𝑇 1 𝛼 2 \displaystyle\mathbb{E}\left[\sum_{t=1}^{T}D_{\Phi_{t}}(\widehat{G}_{t},\widehat{G}_{t-1})\right]\leq\sum_{t=1}^{T}\zeta_{t}^{\alpha-1}\frac{4d}{\alpha(1-\alpha)}\leq\frac{4d}{\alpha(1-\alpha)}\mathcal{O}(T^{\frac{1+\alpha}{2}}).

By tuning α = 2 log T 𝛼 2 𝑇 \alpha=\frac{2}{\log T} , we proved that 𝔼 [ Regret FTPL ( ( i ) i [ T ] ) ] 𝒪 ( ( log d ) 1 2 d T 1 2 + 1 log T log T ) 𝔼 delimited-[] subscript Regret FTPL subscript subscript 𝑖 𝑖 delimited-[] 𝑇 𝒪 superscript 𝑑 1 2 𝑑 superscript 𝑇 1 2 1 𝑇 𝑇 \mathbb{E}[\text{Regret}_{\text{FTPL}}(({\ell}_{i})_{i\in[T]})]\leq\mathcal{O}((\log d)^{\frac{1}{2}}dT^{\frac{1}{2}+\frac{1}{\log T}}\log T) . ∎

Lemma 4 .

Denote the variation of loss vectors as L T = t = 1 T 1 t + 1 t subscript 𝐿 𝑇 superscript subscript 𝑡 1 𝑇 1 subscript norm subscript 𝑡 1 subscript 𝑡 L_{T}=\sum_{t=1}^{T-1}\|\ell_{t+1}-\ell_{t}\|_{\infty} . Suppose there exists an algorithm 𝒜 𝒜 \mathscr{A} for online learning with full-information feedback with regret guarantee that Regret 𝒜 ( ( i ) i [ T ] ) f ( T , d ) subscript Regret 𝒜 subscript subscript 𝑖 𝑖 delimited-[] 𝑇 𝑓 𝑇 𝑑 \text{Regret}_{\mathscr{A}}((\ell_{i})_{i\in[T]})\leq f(T,d) for some function f 𝑓 f , where T 𝑇 T denotes the horizon and d 𝑑 d denotes the policy dimension. Then, there exists another algorithm 𝒜 superscript 𝒜 \mathscr{A}^{\prime} that can achieve

D-Regret 𝒜 ( ( i ) i [ T ] ) min Δ T [ T ] ( T Δ T + 1 ) f ( Δ T , d ) + 2 Δ T L T . subscript D-Regret superscript 𝒜 subscript subscript 𝑖 𝑖 delimited-[] 𝑇 subscript subscript Δ 𝑇 delimited-[] 𝑇 𝑇 subscript Δ 𝑇 1 𝑓 subscript Δ 𝑇 𝑑 2 subscript Δ 𝑇 subscript 𝐿 𝑇 \displaystyle\text{D-Regret}_{\mathscr{A}^{\prime}}(({\ell}_{i})_{i\in[T]})\leq\min_{\Delta_{T}\in[T]}\left(\frac{T}{\Delta_{T}}+1\right)f(\Delta_{T},d)+2\Delta_{T}L_{T}.

Similarly, suppose there exists an algorithm \mathscr{B} for online learning with bandit feedback with regret guarantee that 𝔼 [ Regret ( ( i ) i [ T ] ) ] g ( T , d ) 𝔼 delimited-[] subscript Regret subscript subscript 𝑖 𝑖 delimited-[] 𝑇 𝑔 𝑇 𝑑 \mathbb{E}\left[\text{Regret}_{\mathscr{B}}((\ell_{i})_{i\in[T]})\right]\leq g(T,d) for some function g 𝑔 g ; then there exists another algorithm superscript \mathscr{B}^{\prime} that can achieve

𝔼 [ D-Regret ( ( i ) i [ T ] ) ] min Δ T [ T ] ( T Δ T + 1 ) g ( Δ T , d ) + 2 Δ T L T . 𝔼 delimited-[] subscript D-Regret superscript subscript subscript 𝑖 𝑖 delimited-[] 𝑇 subscript subscript Δ 𝑇 delimited-[] 𝑇 𝑇 subscript Δ 𝑇 1 𝑔 subscript Δ 𝑇 𝑑 2 subscript Δ 𝑇 subscript 𝐿 𝑇 \displaystyle\mathbb{E}[\text{D-Regret}_{\mathscr{B}^{\prime}}(({\ell}_{i})_{i\in[T]})]\leq\min_{\Delta_{T}\in[T]}\left(\frac{T}{\Delta_{T}}+1\right)g(\Delta_{T},d)+2\Delta_{T}L_{T}.
Proof.

We denote 𝒜 superscript 𝒜 \mathscr{A}^{\prime} as the algorithm that restarts 𝒜 𝒜 \mathscr{A} every Δ T subscript Δ 𝑇 \Delta_{T} iterations. We break the time index [ T ] delimited-[] 𝑇 [T] into m 𝑚 m batches 𝒯 1 : m subscript 𝒯 : 1 𝑚 {\mathcal{T}}_{1:m} of size Δ T subscript Δ 𝑇 \Delta_{T} (except for, possibly the last batch). Denote i := min j [ d ] i j assign superscript subscript 𝑖 subscript 𝑗 delimited-[] 𝑑 subscript 𝑖 𝑗 \ell_{i}^{\star}:=\min_{j\in[d]}\ell_{ij} . By Equation (6) of Besbes et al. ( 2014 ) , it holds that for each k [ m ] 𝑘 delimited-[] 𝑚 k\in[m]

min j [ d ] ( t 𝒯 k t ) j t 𝒯 k t 2 Δ T L k , \displaystyle\min_{j\in[d]}\left(\sum_{t\in{\mathcal{T}}_{k}}\ell_{t}\right)_{j}-\sum_{t\in{\mathcal{T}}_{k}}\ell^{\star}_{t}\leq 2\Delta_{T}L_{k},

where we define L k = t 𝒯 k t + 1 t subscript 𝐿 𝑘 subscript 𝑡 subscript 𝒯 𝑘 subscript norm subscript 𝑡 1 subscript 𝑡 L_{k}=\sum_{t\in{\mathcal{T}}_{k}}\|\ell_{t+1}-\ell_{t}\|_{\infty} . Therefore, we have

D-Regret 𝒜 ( ( i ) i [ T ] ) subscript D-Regret superscript 𝒜 subscript subscript 𝑖 𝑖 delimited-[] 𝑇 \displaystyle\text{D-Regret}_{\mathscr{A}^{\prime}}(({\ell}_{i})_{i\in[T]}) min j [ d ] ( t [ T ] t ) j t [ T ] t + k [ m ] Regret 𝒜 ( ( i ) i [ 𝒯 k ] ) \displaystyle\leq\min_{j\in[d]}\left(\sum_{t\in[T]}\ell_{t}\right)_{j}-\sum_{t\in[T]}\ell^{\star}_{t}+\sum_{k\in[m]}\text{Regret}_{\mathscr{A}}((\ell_{i})_{i\in[{\mathcal{T}}_{k}]}) (13)
2 Δ T ( k [ m ] L k ) + ( T / Δ T + 1 ) g ( Δ T , d ) . absent 2 subscript Δ 𝑇 subscript 𝑘 delimited-[] 𝑚 subscript 𝐿 𝑘 𝑇 subscript Δ 𝑇 1 𝑔 subscript Δ 𝑇 𝑑 \displaystyle\leq 2\Delta_{T}(\sum_{k\in[m]}L_{k})+(T/\Delta_{T}+1)g(\Delta_{T},d).

By Equation (4) of Besbes et al. ( 2014 ) that k [ m ] L k L T subscript 𝑘 delimited-[] 𝑚 subscript 𝐿 𝑘 subscript 𝐿 𝑇 \sum_{k\in[m]}L_{k}\leq L_{T} and this inequality holds for any Δ T [ T ] subscript Δ 𝑇 delimited-[] 𝑇 \Delta_{T}\in[T] , we proved D-Regret 𝒜 ( ( i ) i [ T ] ) min Δ T [ T ] ( T Δ T + 1 ) f ( Δ T , d ) + 2 Δ T L T subscript D-Regret superscript 𝒜 subscript subscript 𝑖 𝑖 delimited-[] 𝑇 subscript subscript Δ 𝑇 delimited-[] 𝑇 𝑇 subscript Δ 𝑇 1 𝑓 subscript Δ 𝑇 𝑑 2 subscript Δ 𝑇 subscript 𝐿 𝑇 \text{D-Regret}_{\mathscr{A}^{\prime}}(({\ell}_{i})_{i\in[T]})\leq\min_{\Delta_{T}\in[T]}\left(\frac{T}{\Delta_{T}}+1\right)f(\Delta_{T},d)+2\Delta_{T}L_{T} .

Similarly, if we take the expectation for Equation 13 , it holds that

𝔼 [ D-Regret ( ( i ) i [ T ] ) ] 𝔼 delimited-[] subscript D-Regret superscript subscript subscript 𝑖 𝑖 delimited-[] 𝑇 \displaystyle\mathbb{E}[\text{D-Regret}_{\mathscr{B}^{\prime}}(({\ell}_{i})_{i\in[T]})] min j [ d ] ( t [ T ] t ) j t [ T ] t + k [ m ] 𝔼 [ Regret ( ( i ) i [ 𝒯 k ] ) ] \displaystyle\leq\min_{j\in[d]}\left(\sum_{t\in[T]}\ell_{t}\right)_{j}-\sum_{t\in[T]}\ell^{\star}_{t}+\sum_{k\in[m]}\mathbb{E}[\text{Regret}_{\mathscr{B}}((\ell_{i})_{i\in[{\mathcal{T}}_{k}]})]
min Δ T [ T ] ( T Δ T + 1 ) g ( Δ T , d ) + 2 Δ T L T , absent subscript subscript Δ 𝑇 delimited-[] 𝑇 𝑇 subscript Δ 𝑇 1 𝑔 subscript Δ 𝑇 𝑑 2 subscript Δ 𝑇 subscript 𝐿 𝑇 \displaystyle\leq\min_{\Delta_{T}\in[T]}\left(\frac{T}{\Delta_{T}}+1\right)g(\Delta_{T},d)+2\Delta_{T}L_{T},

thus completing the proof. ∎

Combining the results above completes the proof for Theorem 1 . ∎

C.5 Extending Theorem 1 with Relaxed Assumptions

C.5.1 Relaxation under More General Data Distributions

We first remark on the possibility of relaxing the Gaussian assumptions on the data distributions.

Remark 3 (Relaxing the Gaussian distribution assumption) .

In the proof of Lemma 1 , to obtain the result that the action is a quantal response w.r.t. 1 : T subscript : 1 𝑇 \ell_{1:T} , one does not necessarily require both the prior distribution of z 𝑧 z and the conditional distribution of i subscript 𝑖 {\ell}_{i} given z 𝑧 z to be Gaussian. Instead, for any joint distribution ( z , 1 : T ) 𝑧 subscript : 1 𝑇 \mathbb{P}(z,\ell_{1:T}) , as long as its posterior distribution satisfies Equation 11 , it would suffice. It is a combined effect of both the prior and the conditional distributions.

More formally, we can extend Theorem 1 to the case with a much more general prior task distribution than the Gaussian one, where the key is that Equation 11 only needs to hold approximately.

Theorem 5 .

In Theorem 1 , we can relax the assumption on ( z ) 𝑧 \mathbb{P}(z) to one where we only require ( z ) 𝑧 \mathbb{P}(z) to be i.i.d for each coordinate of z 𝑧 z and 0 < ( z j ) < 0 subscript 𝑧 𝑗 0<\mathbb{P}(z_{j})<\infty , | ( z j ) | < subscript 𝑧 𝑗 |\nabla\mathbb{P}(z_{j})|<\infty for any j [ d ] 𝑗 delimited-[] 𝑑 j\in[d] , z j subscript 𝑧 𝑗 z_{j}\in\mathbb{R} , and the bounds for (1) and (2) of Theorem 1 still hold, with only a degradation of 𝒪 ( d 2 log T ) 𝒪 superscript 𝑑 2 𝑇 \mathcal{O}(d^{2}\log T) .

The key idea of the proof is that when t 𝑡 t is large enough, the prior distribution does not affect the posterior distribution, which is also referred to as the Bernstein–von Mises theorem (Van der Vaart, 2000 ) .

Proof.

Since we extend Theorem 1 to settings with general task prior distribution only requiring the coordinates to be i.i.d, from now on, we consider the j 𝑗 j -th coordinate only. To begin with, fix t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , we define the log-likelihood of the posterior as

L t ( z j ) := log i = 1 t 1 σ d ( 2 π ) d / 2 e 1 2 σ 2 ( i j z j ) 2 = t log σ t 2 log 2 π i = 1 t 1 2 σ 2 ( i j z j ) 2 . assign subscript 𝐿 𝑡 subscript 𝑧 𝑗 superscript subscript product 𝑖 1 𝑡 1 superscript 𝜎 𝑑 superscript 2 𝜋 𝑑 2 superscript 𝑒 1 2 superscript 𝜎 2 superscript subscript 𝑖 𝑗 subscript 𝑧 𝑗 2 𝑡 𝜎 𝑡 2 2 𝜋 superscript subscript 𝑖 1 𝑡 1 2 superscript 𝜎 2 superscript subscript 𝑖 𝑗 subscript 𝑧 𝑗 2 \displaystyle L_{t}(z_{j}):=\log\prod_{i=1}^{t}\frac{1}{\sigma^{d}(2\pi)^{d/2}}e^{-\frac{1}{2\sigma^{2}}(\ell_{ij}-z_{j})^{2}}=-t\log\sigma-\frac{t}{2}\log 2\pi-\sum_{i=1}^{t}\frac{1}{2\sigma^{2}}(\ell_{ij}-z_{j})^{2}.

Then, the MLE estimator z ^ j , t subscript ^ 𝑧 𝑗 𝑡 \widehat{z}_{j,t} is defined as

z ^ j , t := arg max z j L t ( z j ) = 1 t i = 1 t i j . assign subscript ^ 𝑧 𝑗 𝑡 subscript subscript 𝑧 𝑗 subscript 𝐿 𝑡 subscript 𝑧 𝑗 1 𝑡 superscript subscript 𝑖 1 𝑡 subscript 𝑖 𝑗 \displaystyle\widehat{z}_{j,t}:=\arg\max_{z_{j}\in\mathbb{R}}L_{t}(z_{j})=\frac{1}{t}\sum_{i=1}^{t}\ell_{ij}.

We also define J ^ t : : subscript ^ 𝐽 𝑡 \widehat{J}_{t}:\mathbb{R}\to\mathbb{R} as:

J ^ t ( z j ) := 2 L t ( z j ) t = 1 σ 2 . assign subscript ^ 𝐽 𝑡 subscript 𝑧 𝑗 superscript 2 subscript 𝐿 𝑡 subscript 𝑧 𝑗 𝑡 1 superscript 𝜎 2 \displaystyle\widehat{J}_{t}(z_{j}):=-\frac{\nabla^{2}L_{t}(z_{j})}{t}=\frac{1}{\sigma^{2}}.

For Assumption 1 of Kasprzak et al. ( 2022 ) to hold, any δ > 0 𝛿 0 \delta>0 , M 2 > 0 subscript 𝑀 2 0 M_{2}>0 suffices.

For Assumption 2 of Kasprzak et al. ( 2022 ) to hold, we can choose M ^ 1 = max z j [ δ , 1 + δ ] 1 ( z j ) subscript ^ 𝑀 1 subscript subscript 𝑧 𝑗 𝛿 1 𝛿 1 subscript 𝑧 𝑗 \widehat{M}_{1}=\max_{z_{j}\in[-\delta,1+\delta]}\frac{1}{\mathbb{P}(z_{j})}

For Assumption 7 of Kasprzak et al. ( 2022 ) to hold, we choose δ 𝛿 \delta to be σ 𝜎 \sigma .

For Assumption 8 of Kasprzak et al. ( 2022 ) to hold, one can choose M 2 = σ 2 subscript 𝑀 2 𝜎 2 M_{2}=\frac{\sigma}{2} .

For Assumption 9 of Kasprzak et al. ( 2022 ) to hold, we have

κ sup ( z j z ^ j ) 2 δ L t ( z j ) L t ( z ^ j , t ) t = 1 2 σ 2 t sup ( z j z ^ j , t ) 2 δ i = 1 t ( i j z ^ j , t ) 2 ( i j z j ) 2 = 1 4 σ . 𝜅 subscript supremum superscript subscript 𝑧 𝑗 subscript ^ 𝑧 𝑗 2 𝛿 subscript 𝐿 𝑡 subscript 𝑧 𝑗 subscript 𝐿 𝑡 subscript ^ 𝑧 𝑗 𝑡 𝑡 1 2 superscript 𝜎 2 𝑡 subscript supremum superscript subscript 𝑧 𝑗 subscript ^ 𝑧 𝑗 𝑡 2 𝛿 superscript subscript 𝑖 1 𝑡 superscript subscript 𝑖 𝑗 subscript ^ 𝑧 𝑗 𝑡 2 superscript subscript 𝑖 𝑗 subscript 𝑧 𝑗 2 1 4 𝜎 \displaystyle\kappa\leq-\sup_{(z_{j}-\widehat{z}_{j})^{2}\geq\delta}\frac{L_{t}(z_{j})-L_{t}(\widehat{z}_{j,t})}{t}=-\frac{1}{2\sigma^{2}t}\sup_{(z_{j}-\widehat{z}_{j,t})^{2}\geq\delta}\sum_{i=1}^{t}(\ell_{ij}-\widehat{z}_{j,t})^{2}-(\ell_{ij}-z_{j})^{2}=\frac{1}{4\sigma}.

For Assumption 10 of Kasprzak et al. ( 2022 ) to hold, we choose M 1 = sup z j [ δ , 1 + δ ] | ( z j ) ( z j ) | subscript 𝑀 1 subscript supremum subscript 𝑧 𝑗 𝛿 1 𝛿 subscript 𝑧 𝑗 subscript 𝑧 𝑗 M_{1}=\sup_{z_{j}\in[-\delta,1+\delta]}\left|\frac{\nabla\mathbb{P}(z_{j})}{\mathbb{P}(z_{j})}\right| , M ~ 1 = sup z j [ δ , 1 + δ ] | ( z j ) | subscript ~ 𝑀 1 subscript supremum subscript 𝑧 𝑗 𝛿 1 𝛿 subscript 𝑧 𝑗 \widetilde{M}_{1}=\sup_{z_{j}\in[-\delta,1+\delta]}|\mathbb{P}(z_{j})| since we have assumed that 0 < ( z j ) < 0 subscript 𝑧 𝑗 0<\mathbb{P}(z_{j})<\infty , | ( z j ) | < subscript 𝑧 𝑗 |\nabla\mathbb{P}(z_{j})|<\infty .

By Theorem 6.1 of Kasprzak et al. ( 2022 ) , we have

z j | \displaystyle\int_{z_{j}}| ( z j / t + z ^ j | ( i j ) i [ t ] ) C e 1 2 σ 2 z j 2 | d z j subscript 𝑧 𝑗 𝑡 conditional subscript ^ 𝑧 𝑗 subscript subscript 𝑖 𝑗 𝑖 delimited-[] 𝑡 conditional 𝐶 superscript 𝑒 1 2 superscript 𝜎 2 superscript subscript 𝑧 𝑗 2 𝑑 subscript 𝑧 𝑗 \displaystyle\mathbb{P}(z_{j}/\sqrt{t}+\widehat{z}_{j}{\,|\,}(\ell_{ij})_{i\in[t]})-Ce^{-\frac{1}{2\sigma^{2}}z_{j}^{2}}|dz_{j}
= t z j | ( z j | ( i j ) i [ t ] ) 𝒩 ( z ^ j , σ 2 t ) | d z j D 1 t 1 / 2 + D 2 t 1 / 2 e t κ + 2 𝒟 ^ ( t , δ ) , \displaystyle=\sqrt{t}\int_{z_{j}}|\mathbb{P}(z_{j}{\,|\,}(\ell_{ij})_{i\in[t]})-\mathcal{N}(\widehat{z}_{j},\frac{\sigma^{2}}{t})|dz_{j}\leq D_{1}t^{-1/2}+D_{2}t^{1/2}e^{-t\kappa}+2\widehat{\mathcal{D}}(t,\delta),

where C 𝐶 C is the normalization constant and

D 1 subscript 𝐷 1 \displaystyle D_{1} = M ~ 1 M ^ 1 σ ( 3 σ 2 2 ( 1 𝒟 ^ ( t , δ ) ) M 2 + M 1 ) absent subscript ~ 𝑀 1 subscript ^ 𝑀 1 𝜎 3 superscript 𝜎 2 2 1 ^ 𝒟 𝑡 𝛿 subscript 𝑀 2 subscript 𝑀 1 \displaystyle=\frac{\sqrt{\widetilde{M}_{1}\widehat{M}_{1}}}{\sigma}\left(\frac{\sqrt{3}\sigma^{2}}{2\left(1-\sqrt{\widehat{\mathcal{D}}(t,\delta)}\right)}M_{2}+M_{1}\right)
D 2 subscript 𝐷 2 \displaystyle D_{2} = 2 M 1 ^ J ^ t p ( z ^ j , δ ) ( 2 π ) 1 / 2 ( 1 𝒟 ^ p ( t , δ ) ) absent 2 ^ subscript 𝑀 1 superscript subscript ^ 𝐽 𝑡 𝑝 subscript ^ 𝑧 𝑗 𝛿 superscript 2 𝜋 1 2 1 superscript ^ 𝒟 𝑝 𝑡 𝛿 \displaystyle=\frac{2\widehat{M_{1}}\widehat{J}_{t}^{p}(\widehat{z}_{j},\delta)}{(2\pi)^{1/2}(1-\widehat{\mathcal{D}}^{p}(t,\delta))}
𝒟 ^ ( t , δ ) ^ 𝒟 𝑡 𝛿 \displaystyle\widehat{\mathcal{D}}(t,\delta) = e 1 2 ( t 1 ) 2 absent superscript 𝑒 1 2 superscript 𝑡 1 2 \displaystyle=e^{-\frac{1}{2}(\sqrt{t}-1)^{2}}
J ^ t p ( z ^ j , δ ) superscript subscript ^ 𝐽 𝑡 𝑝 subscript ^ 𝑧 𝑗 𝛿 \displaystyle\widehat{J}_{t}^{p}(\widehat{z}_{j},\delta) = 1 σ 2 + δ M 2 3 . absent 1 superscript 𝜎 2 𝛿 subscript 𝑀 2 3 \displaystyle=\frac{1}{\sigma^{2}}+\frac{\delta M_{2}}{3}.

Therefore, we conclude that the TV distance between z 𝑧 z (conditioned on ( i ) i [ t ] subscript subscript 𝑖 𝑖 delimited-[] 𝑡 (\ell_{i})_{i\in[t]} ) and 𝒩 ( z ^ , σ 2 t ) 𝒩 ^ 𝑧 superscript 𝜎 2 𝑡 \mathcal{N}\left(\widehat{z},\frac{\sigma^{2}}{t}\right) satisfies that

z | ( z | ( i ) i [ t ] ) 𝒩 ( z ^ , σ 2 t ) | d z j = 1 d z j | ( z j | ( i j ) i [ t ] ) 𝒩 ( z ^ j , σ 2 t ) | d z j 𝒪 ( d / t ) , \displaystyle\int_{z}\bigg{|}\mathbb{P}(z{\,|\,}(\ell_{i})_{i\in[t]})-\mathcal{N}\left(\widehat{z},\frac{\sigma^{2}}{t}\right)\bigg{|}dz\leq\sum_{j=1}^{d}\int_{z_{j}}\bigg{|}\mathbb{P}(z_{j}{\,|\,}(\ell_{ij})_{i\in[t]})-\mathcal{N}\left(\widehat{z}_{j},\frac{\sigma^{2}}{t}\right)\bigg{|}dz_{j}\leq\mathcal{O}(d/t),

due to the independence of ( z j ) j [ d ] subscript subscript 𝑧 𝑗 𝑗 delimited-[] 𝑑 (z_{j})_{j\in[d]} conditioned on 1 : t subscript : 1 𝑡 \ell_{1:t} . Now we denote algorithm FTPL ^ ^ FTPL \widehat{\text{FTPL}} to be the FTPL algorithm w.r.t. the noise distribution ( z | ( i ) i [ t ] ) conditional 𝑧 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 \mathbb{P}(z{\,|\,}(\ell_{i})_{i\in[t]}) , and FTPL to be the algorithm w.r.t. the noise distribution 𝒩 ( z ^ , σ 2 t ) 𝒩 ^ 𝑧 superscript 𝜎 2 𝑡 \mathcal{N}(\widehat{z},\frac{\sigma^{2}}{t}) . Therefore, we have

| Regret FTPL ( ( ) i [ T ] ) Regret FTPL ^ ( ( ) i [ T ] ) | t = 1 T d π t π ^ t subscript Regret FTPL subscript 𝑖 delimited-[] 𝑇 subscript Regret ^ FTPL subscript 𝑖 delimited-[] 𝑇 superscript subscript 𝑡 1 𝑇 𝑑 subscript norm subscript 𝜋 𝑡 subscript ^ 𝜋 𝑡 \displaystyle\big{|}\text{Regret}_{\text{FTPL}}((\ell)_{i\in[T]})-\text{Regret}_{\widehat{\text{FTPL}}}((\ell)_{i\in[T]})\big{|}\leq\sum_{t=1}^{T}d\|\pi_{t}-\widehat{\pi}_{t}\|_{\infty}
d t = 1 T z | ( z | ( i ) i [ t ] ) 𝒩 ( z ^ , σ 2 t ) | d z = 𝒪 ( d 2 log T ) . \displaystyle\quad\leq d\sum_{t=1}^{T}\int_{z}\big{|}\mathbb{P}(z{\,|\,}(\ell_{i})_{i\in[t]})-\mathcal{N}(\widehat{z},\frac{\sigma^{2}}{t})\big{|}dz=\mathcal{O}(d^{2}\log T).

In other words, using ( z | ( i ) i [ t ] ) conditional 𝑧 subscript subscript 𝑖 𝑖 delimited-[] 𝑡 \mathbb{P}(z{\,|\,}(\ell_{i})_{i\in[t]}) as the noise distribution only increases the regret by 𝒪 ( d 2 log T ) 𝒪 superscript 𝑑 2 𝑇 \mathcal{O}(d^{2}\log T) . Similarly, it is easy to see that

| D-Regret FTPL ( ( ) i [ T ] ) D-Regret FTPL ^ ( ( ) i [ T ] ) | 𝒪 ( d 2 log T ) , subscript D-Regret FTPL subscript 𝑖 delimited-[] 𝑇 subscript D-Regret ^ FTPL subscript 𝑖 delimited-[] 𝑇 𝒪 superscript 𝑑 2 𝑇 \displaystyle\big{|}\text{D-Regret}_{\text{FTPL}}((\ell)_{i\in[T]})-\text{D-Regret}_{\widehat{\text{FTPL}}}((\ell)_{i\in[T]})\big{|}\leq\mathcal{O}(d^{2}\log T),

which completes the proof. ∎

C.5.2 Relaxation under Decision-Irrelevant Pre-Training Data

We then remark on the possible relaxation when the training data may not all come from decision-making tasks.

Remark 4 (Pre-training with relaxed data assumptions) .

Note that the pre-training (text) data are so far assumed to be related to decision-making problems (though not necessarily sequential ones ), see 1 and Example 1 for instance. It can also be generalized to the text datasets involving Question-Answering (Q-A), a typical task in natural language processing, where the true/fact answer, sampled answers from different human users (with possibly wrong or biased answers), correspond to the latent z 𝑧 z (and associated maximizer a 𝑎 a ) and 1 : t subscript : 1 𝑡 \ell_{1:t} , respectively. Moreover, in practice, the pre-training data may also involve non-decision-making/Q-A texts, given the diversity of the datasets. For such scenarios, we will make the assumptions on the data distribution conditioned on the prompt for decision-making . Specifically, when interacting with the LLM, human users will provide prompts (see e.g., our Figure 3 ), to induce it to make decisions. This will query the conditional distribution of

( g ( x N t + 1 : N t + 1 ) | x 1 : N t , decision-making prompt ) conditional 𝑔 subscript 𝑥 : subscript 𝑁 𝑡 1 subscript 𝑁 𝑡 1 subscript 𝑥 : 1 subscript 𝑁 𝑡 decision-making prompt \mathbb{P}\left(g(x_{N_{t}+1:N_{t+1}}){\,\big{|}\,}x_{1:N_{t}},\text{decision-making prompt}\right)

to generate the control action. Correspondingly, 1 will thus only need to be made on

( z , 1 : t , x 1 : N t + 1 , decision-making prompt ) , 𝑧 subscript : 1 𝑡 subscript 𝑥 : 1 subscript 𝑁 𝑡 1 decision-making prompt \mathbb{P}\left(z,\ell_{1:t},x_{1:N_{t+1}},\text{decision-making prompt}\right),

while we do not need to make such assumptions on other prompts, e.g., corpora that are not related to decision-making.

Appendix D Deferred Results and Proofs in Section 5

D.1 Basic Lemmas

Lemma 5 (Double iterated limit) .

For a sequence ( a m n ) m , n + subscript subscript 𝑎 𝑚 𝑛 𝑚 𝑛 superscript (a_{mn})_{m,n\in\mathbb{N}^{+}} , suppose that lim m , n a m n = L subscript 𝑚 𝑛 subscript 𝑎 𝑚 𝑛 𝐿 \lim_{m,n\to\infty}a_{mn}=L . Then the following are equivalent:

  • For each m 𝑚 m , lim n a m n subscript 𝑛 subscript 𝑎 𝑚 𝑛 \lim_{n\to\infty}a_{mn} exists;

  • lim m lim n a m n = L subscript 𝑚 subscript 𝑛 subscript 𝑎 𝑚 𝑛 𝐿 \lim_{m\to\infty}\lim_{n\to\infty}a_{mn}=L .

Lemma 6 (Hoeffding’s inequality) .

Let X 1 , X 2 , , X n subscript 𝑋 1 subscript 𝑋 2 subscript 𝑋 𝑛 X_{1},X_{2},\ldots,X_{n} be independent random variables bounded by the intervals [ a i , b i ] subscript 𝑎 𝑖 subscript 𝑏 𝑖 [a_{i},b_{i}] , respectively. Define X ¯ = 1 n i = 1 n X i ¯ 𝑋 1 𝑛 superscript subscript 𝑖 1 𝑛 subscript 𝑋 𝑖 \bar{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i} and let μ = 𝔼 [ X ¯ ] 𝜇 𝔼 delimited-[] ¯ 𝑋 \mu=\mathbb{E}[\bar{X}] be the expected value of X ¯ ¯ 𝑋 \bar{X} . Then, for any t > 0 𝑡 0 t>0 ,

( | X ¯ μ | t ) 2 exp ( 2 n 2 t 2 i = 1 n ( b i a i ) 2 ) . ¯ 𝑋 𝜇 𝑡 2 2 superscript 𝑛 2 superscript 𝑡 2 superscript subscript 𝑖 1 𝑛 superscript subscript 𝑏 𝑖 subscript 𝑎 𝑖 2 \mathbb{P}(|\bar{X}-\mu|\geq t)\leq 2\exp\left(-\frac{2n^{2}t^{2}}{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}\right).
Lemma 7 (Uniform convergence \Longrightarrow Interchanging limit and infimum) .

If ( f n : X ) n + (f_{n}:X\to\mathbb{R})_{n\in\mathbb{N}^{+}} is a sequence of continuous functions that uniformly converge to a function f : X : 𝑓 𝑋 f:X\to\mathbb{R} on the domain X 𝑋 X , then lim n inf x X f n ( x ) = inf x X f ( x ) subscript 𝑛 subscript infimum 𝑥 𝑋 subscript 𝑓 𝑛 𝑥 subscript infimum 𝑥 𝑋 𝑓 𝑥 \lim_{n\to\infty}\inf_{x\in X}f_{n}(x)=\inf_{x\in X}f(x) holds.

D.2 Deferred Proof for the Arguments in Section 5.1

In this section, we prove some properties of ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) under certain regularity conditions of f , h 𝑓 f,h . Throughout this subsection, we will assume the following condition holds.

Condition 1 .

For h : + : superscript h:\mathbb{R}\to\mathbb{R}^{+} and f : × + + : 𝑓 superscript superscript f:\mathbb{R}\times\mathbb{N}^{+}\to\mathbb{R}^{+} , suppose h ( ) h(\cdot) and f ( , k ) 𝑓 𝑘 f(\cdot,k) are both continuous and non-decreasing functions for any k + 𝑘 superscript k\in\mathbb{N}^{+} . The derivative h : : superscript h^{\prime}:\mathbb{R}\to\mathbb{R} is also a continuous function. Moreover, f 𝑓 f satisfies that log f ( R 1 , k 1 ) log f ( R 1 , k 2 ) log f ( R 2 , k 1 ) log f ( R 2 , k 2 ) 𝑓 subscript 𝑅 1 subscript 𝑘 1 𝑓 subscript 𝑅 1 subscript 𝑘 2 𝑓 subscript 𝑅 2 subscript 𝑘 1 𝑓 subscript 𝑅 2 subscript 𝑘 2 \log f(R_{1},k_{1})-\log f(R_{1},k_{2})\geq\log f(R_{2},k_{1})-\log f(R_{2},k_{2}) for R 1 R 2 subscript 𝑅 1 subscript 𝑅 2 R_{1}\geq R_{2} and k 1 k 2 subscript 𝑘 1 subscript 𝑘 2 k_{1}\geq k_{2} , i.e., log f 𝑓 \log f is supermodular. Lastly, f 𝑓 f is a function such that lim k f ( R 1 , k ) f ( R 2 , k ) = 𝟙 ( R 1 > R 2 ) + 𝟙 ( R 1 = R 2 ) subscript 𝑘 𝑓 subscript 𝑅 1 𝑘 𝑓 subscript 𝑅 2 𝑘 1 subscript 𝑅 1 subscript 𝑅 2 1 subscript 𝑅 1 subscript 𝑅 2 \lim_{k\to\infty}\frac{f(R_{1},k)}{f(R_{2},k)}=\infty\cdot\mathbbm{1}(R_{1}>R_{2})+\mathbbm{1}(R_{1}=R_{2}) , with the convention of 0 = 0 0 0 \infty\cdot 0=0 . Lastly, ( t ( j ) ) t [ T ] , j [ N ] subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 (\ell_{t}^{(j)})_{t\in[T],j\in[N]} are continuous random variables supported on [ B , B ] T × N superscript 𝐵 𝐵 𝑇 𝑁 [-B,B]^{T\times N} .

Claim 1 (Iterated limit of ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) is the same as double limit of ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) ) .

It holds that:

lim N lim k ( θ , k , N ) = lim N , k ( θ , k , N ) = lim k lim N ( θ , k , N ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) . subscript 𝑁 subscript 𝑘 𝜃 𝑘 𝑁 subscript 𝑁 𝑘 𝜃 𝑘 𝑁 subscript 𝑘 subscript 𝑁 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\lim_{N\to\infty}\lim_{k\to\infty}\mathcal{L}(\theta,k,N)=\lim_{N,k\to\infty}\mathcal{L}(\theta,k,N)=\lim_{k\to\infty}\lim_{N\to\infty}\mathcal{L}(\theta,k,N)=h\left(\max_{\ell_{1},\dots,\ell_{T}}\emph{Regret}_{\emph{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right).
Proof.

Step 1. Proving lim N lim k ( θ , k , N ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑁 subscript 𝑘 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N\to\infty}\lim_{k\to\infty}\mathcal{L}(\theta,k,N)=h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right) .

Firstly, as both h h and f 𝑓 f are non-negative ( 1 ), and 𝔼 ( t ( j ) ) t [ T ] , j [ N ] [ h ( max j [ N ] Regret LLM θ ( ( t ( j ) ) t [ T ] ) ) ] subscript 𝔼 subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 \mathbb{E}_{(\ell_{t}^{(j)})_{t\in[T],j\in[N]}}\left[h(\max_{j\in[N]}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))\right] exists, we have by dominated convergence theorem that

lim k ( θ , k , N ) subscript 𝑘 𝜃 𝑘 𝑁 \displaystyle\lim_{k\to\infty}\mathcal{L}(\theta,k,N) = 𝔼 lim k [ j [ N ] h ( R LLM θ ( ( t ( j ) ) t [ T ] ) ) f ( R LLM θ ( ( t ( j ) ) t [ T ] ) , k ) j [ N ] f ( R LLM θ ( ( i ( j ) ) t [ T ] ) , k ) ] absent 𝔼 subscript 𝑘 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑗 delimited-[] 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑖 𝑗 𝑡 delimited-[] 𝑇 𝑘 \displaystyle=\mathbb{E}\lim_{k\to\infty}\left[\frac{\sum_{j\in[N]}h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}),k)}{\sum_{j\in[N]}f(R_{\text{LLM}_{\theta}}((\ell_{i}^{(j)})_{t\in[T]}),k)}\right]
= 𝔼 ( t ( j ) ) t [ T ] , j [ N ] [ h ( max j [ N ] R LLM θ ( ( t ( j ) ) t [ T ] ) ) ] absent subscript 𝔼 subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 \displaystyle=\mathbb{E}_{(\ell_{t}^{(j)})_{t\in[T],j\in[N]}}\left[h(\max_{j\in[N]}R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))\right]

where R LLM θ subscript 𝑅 subscript LLM 𝜃 R_{\text{LLM}_{\theta}} denotes an abbreviation of Regret LLM θ subscript Regret subscript LLM 𝜃 \text{Regret}_{\text{LLM}_{\theta}} . By (Ahsanullah et al., 2013 , Chapter 11) , we have h ( max j [ N ] Regret LLM θ ( ( t ( j ) ) t [ T ] ) ) 𝑝 h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑗 delimited-[] 𝑁 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑝 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 h(\max_{j\in[N]}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))\overset{p}{\to}h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})) when N 𝑁 N\to\infty . Hence, we have lim N lim k ( θ , k , N ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑁 subscript 𝑘 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N\to\infty}\lim_{k\to\infty}\mathcal{L}(\theta,k,N)=h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})) holds.

Step 2. Proving lim N , k ( θ , k , N ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑁 𝑘 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N,k\to\infty}\mathcal{L}(\theta,k,N)=h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})) .

Now, we will calculate lim N , k ( θ , k , N ) subscript 𝑁 𝑘 𝜃 𝑘 𝑁 \lim_{N,k\to\infty}\mathcal{L}(\theta,k,N) .

Lemma 8 .

For any 0 < ϵ < 1 0 italic-ϵ 1 0<\epsilon<1 , it follows that

lim N , k i = 1 N f ( X i , k ) H ( X i ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) H ( X i ) 𝟙 ( H ( X i ) > 1 ϵ / 2 ) = 0 subscript 𝑁 𝑘 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 𝐻 subscript 𝑋 𝑖 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 𝐻 subscript 𝑋 𝑖 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 0 \lim_{N,k\to\infty}\frac{\sum_{i=1}^{N}f(X_{i},k)H(X_{i})\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)H(X_{i})\mathbbm{1}(H(X_{i})>1-\epsilon/2)}=0

and

lim N , k i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) > 1 ϵ / 2 ) = 0 subscript 𝑁 𝑘 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 0 \lim_{N,k\to\infty}\frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})>1-\epsilon/2)}=0

hold with probability 1 1 1 , where X i subscript 𝑋 𝑖 X_{i} ’s are i.i.d. random variables, esssup ( H ( X i ) ) = 1 esssup 𝐻 subscript 𝑋 𝑖 1 \operatorname{esssup}(H(X_{i}))=1 , and H : + : 𝐻 superscript H:\mathbb{R}\to\mathbb{R}^{+} is a continuous non-decreasing function.

Proof of Lemma 8 .

Since f ( , k ) , H 𝑓 𝑘 𝐻 f(\cdot,k),H are non-negative and non-decreasing functions, we have

i = 1 N f ( X i , k ) H ( X i ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) H ( X i ) 𝟙 ( H ( X i ) > 1 ϵ / 2 ) ( 1 ϵ ) f ( H 1 ( 1 ϵ ) , k ) | { i [ N ] ( H ( X i ) < 1 ϵ ) } | ( 1 ϵ / 2 ) f ( H 1 ( 1 ϵ / 2 ) , k ) | { i [ N ] ( H ( X i ) > 1 ϵ / 2 ) } | superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 𝐻 subscript 𝑋 𝑖 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 𝐻 subscript 𝑋 𝑖 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 1 italic-ϵ 𝑓 superscript 𝐻 1 1 italic-ϵ 𝑘 conditional-set 𝑖 delimited-[] 𝑁 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 1 italic-ϵ 2 𝑓 superscript 𝐻 1 1 italic-ϵ 2 𝑘 conditional-set 𝑖 delimited-[] 𝑁 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 \displaystyle\frac{\sum_{i=1}^{N}f(X_{i},k)H(X_{i})\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)H(X_{i})\mathbbm{1}(H(X_{i})>1-\epsilon/2)}\leq\frac{(1-\epsilon)f(H^{-1}(1-\epsilon),k)|\{i\in[N]\mid(H(X_{i})<1-\epsilon)\}|}{(1-\epsilon/2)f(H^{-1}(1-\epsilon/2),k)|\{i\in[N]\mid(H(X_{i})>1-\epsilon/2)\}|}

and we know that

| { i [ N ] ( H ( X i ) < 1 ϵ ) } | | { i [ N ] ( H ( X i ) > 1 ϵ / 2 ) } | a . s . F ( 1 ϵ ) 1 F ( 1 ϵ / 2 ) \displaystyle\frac{|\{i\in[N]\mid(H(X_{i})<1-\epsilon)\}|}{|\{i\in[N]\mid(H(X_{i})>1-\epsilon/2)\}|}\overset{a.s.}{\to}\frac{F(1-\epsilon)}{1-F(1-\epsilon/2)}

as N 𝑁 N\to\infty , where F 𝐹 F is the cumulative distribution function of random variable H ( X ) 𝐻 𝑋 H(X) . Therefore, we have

0 lim N , k i = 1 N f ( X i , k ) H ( X i ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) H ( X i ) 𝟙 ( H ( X i ) > 1 ϵ / 2 ) 0 subscript 𝑁 𝑘 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 𝐻 subscript 𝑋 𝑖 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 𝐻 subscript 𝑋 𝑖 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 \displaystyle 0\leq\lim_{N,k\to\infty}\frac{\sum_{i=1}^{N}f(X_{i},k)H(X_{i})\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)H(X_{i})\mathbbm{1}(H(X_{i})>1-\epsilon/2)} lim N , k ( 1 ϵ ) f ( H 1 ( 1 ϵ ) , k ) ) | { i [ N ] ( H ( X i ) < 1 ϵ ) } | ( 1 ϵ / 2 ) f ( H 1 ( 1 ϵ / 2 ) , k ) ) | { i [ N ] ( H ( X i ) > 1 ϵ / 2 ) } | \displaystyle\leq\lim_{N,k\to\infty}\frac{(1-\epsilon)f(H^{-1}(1-\epsilon),k))|\{i\in[N]\mid(H(X_{i})<1-\epsilon)\}|}{(1-\epsilon/2)f(H^{-1}(1-\epsilon/2),k))|\{i\in[N]\mid(H(X_{i})>1-\epsilon/2)\}|}
a . s . lim N , k ( 1 ϵ ) f ( H 1 ( 1 ϵ ) , k ) ) ( 1 ϵ / 2 ) f ( H 1 ( 1 ϵ / 2 ) , k ) ) F ( 1 ϵ ) 1 F ( 1 ϵ / 2 ) = 0 . \displaystyle\underset{a.s.}{\leq}\lim_{N,k\to\infty}\frac{(1-\epsilon)f(H^{-1}(1-\epsilon),k))}{(1-\epsilon/2)f(H^{-1}(1-\epsilon/2),k))}\frac{F(1-\epsilon)}{1-F(1-\epsilon/2)}=0.

By a similar argument, we have

lim N , k i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) > 1 ϵ / 2 ) = 0 subscript 𝑁 𝑘 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 0 \lim_{N,k\to\infty}\frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})>1-\epsilon/2)}=0

with probability 1. ∎

One key idea in the proof above is the use of some truncation level ϵ italic-ϵ \epsilon for H ( X ) 𝐻 𝑋 H(X) with esssup ( H ( X ) ) = 1 esssup 𝐻 𝑋 1 \text{esssup}(H(X))=1 . By Lemma 8 , we have

lim N , k i = 1 N f ( X i , k ) H ( X i ) 𝟙 ( H ( X i ) > 1 ϵ ) i = 1 N f ( X i , k ) H ( X i ) = lim N , k i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) > 1 ϵ ) i = 1 N f ( X i , k ) = 1 , subscript 𝑁 𝑘 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 𝐻 subscript 𝑋 𝑖 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 𝐻 subscript 𝑋 𝑖 subscript 𝑁 𝑘 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 \lim_{N,k\to\infty}\frac{\sum_{i=1}^{N}f(X_{i},k)H(X_{i})\mathbbm{1}(H(X_{i})>1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)H(X_{i})}=\lim_{N,k\to\infty}\frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})>1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)}=1,

since

0 i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) > 1 ϵ / 2 ) 0 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 \displaystyle 0\leq\frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)}\leq\frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})>1-\epsilon/2)}

holds with probability 1. Therefore, for any 0 < ϵ < 1 0 italic-ϵ 1 0<\epsilon<1 , we have

lim N , k ( θ , k , N ) = 𝔼 lim N , k [ j [ N ] h ( R LLM θ ( ( t ( j ) ) t [ T ] ) ) f ( R LLM θ ( ( t ( j ) ) t [ T ] ) , k ) j [ N ] f ( R LLM θ ( ( i ( j ) ) t [ T ] ) , k ) ] subscript 𝑁 𝑘 𝜃 𝑘 𝑁 𝔼 subscript 𝑁 𝑘 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑗 delimited-[] 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑖 𝑗 𝑡 delimited-[] 𝑇 𝑘 \displaystyle\lim_{N,k\to\infty}\mathcal{L}(\theta,k,N)=\mathbb{E}\lim_{N,k\to\infty}\left[\frac{\sum_{j\in[N]}h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}),k)}{\sum_{j\in[N]}f(R_{\text{LLM}_{\theta}}((\ell_{i}^{(j)})_{t\in[T]}),k)}\right]
= h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) absent subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle=h\left(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)
× 𝔼 lim N , k [ j [ N ] h ( R LLM θ ( ( t ( j ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) f ( R LLM θ ( ( t ( j ) ) t [ T ] ) , k ) 𝟙 ( h ( R LLM θ ( ( t ( j ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) > 1 ϵ ) j [ N ] f ( R LLM θ ( ( i ( j ) ) t [ T ] ) , k ) 𝟙 ( h ( R LLM θ ( ( t ( j ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) > 1 ϵ ) ] absent 𝔼 subscript 𝑁 𝑘 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 1 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ subscript 𝑗 delimited-[] 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑖 𝑗 𝑡 delimited-[] 𝑇 𝑘 1 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ \displaystyle\,\,\quad\times\mathbb{E}\lim_{N,k\to\infty}\left[\frac{\sum_{j\in[N]}\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}),k)\mathbbm{1}(\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}>1-\epsilon)}{\sum_{j\in[N]}f(R_{\text{LLM}_{\theta}}((\ell_{i}^{(j)})_{t\in[T]}),k)\mathbbm{1}(\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}>1-\epsilon)}\right]
( 1 ϵ ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) absent 1 italic-ϵ subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\geq(1-\epsilon)h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))

which implies lim N , k ( θ , k , N ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑁 𝑘 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N,k\to\infty}\mathcal{L}(\theta,k,N)=h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})) since

( θ , k , N ) h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \mathcal{L}(\theta,k,N)\leq h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)

by definition of \mathcal{L} , the fact that h h is non-decreasing, and by setting ϵ 0 italic-ϵ 0 \epsilon\to 0 to obtain

( θ , k , N ) h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) . 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \mathcal{L}(\theta,k,N)\geq h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right).

Here, we used the fact that ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 (\ell_{t})_{t\in[T]} has a continuous distribution, Regret LLM θ ( ( t ) t [ T ] ) subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}) is a continuous function, and the non-decreasing property and continuity of h h ( 1 ), which lead to:

esssup ( h ( Regret LLM θ ( ( t ) t [ T ] ) ) ) = max 1 , , T h ( Regret LLM θ ( ( t ) t [ T ] ) ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) . esssup subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\text{esssup}\left(h\left(\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)\right)=\max_{\ell_{1},\dots,\ell_{T}}h\left(\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)=h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right). (14)

Equation 14 will be used frequently in the overall proof in Section D.2 .

Step 3. Proving lim k lim N ( θ , k , N ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑘 subscript 𝑁 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{k\to\infty}\lim_{N\to\infty}\mathcal{L}(\theta,k,N)=h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right) .

Lastly, if N 𝑁 N\to\infty , similarly by dominated convergence theorem we have

lim N ( θ , k , N ) subscript 𝑁 𝜃 𝑘 𝑁 \displaystyle\lim_{N\to\infty}\mathcal{L}(\theta,k,N) = 𝔼 lim N [ j [ N ] h ( R LLM θ ( ( t ( j ) ) t [ T ] ) ) f ( R LLM θ ( ( t ( j ) ) t [ T ] ) , k ) j [ N ] f ( R LLM θ ( ( i ( j ) ) t [ T ] ) , k ) ] absent 𝔼 subscript 𝑁 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑗 delimited-[] 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑖 𝑗 𝑡 delimited-[] 𝑇 𝑘 \displaystyle=\mathbb{E}\lim_{N\to\infty}\left[\frac{\sum_{j\in[N]}h\left(R_{\text{LLM}_{\theta}}\left((\ell_{t}^{(j)})_{t\in[T]}\right)\right)f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}),k)}{\sum_{j\in[N]}f\left(R_{\text{LLM}_{\theta}}\left((\ell_{i}^{(j)})_{t\in[T]}\right),k\right)}\right]
= 𝔼 [ h ( R LLM θ ( ( t ( j ) ) t [ T ] ) ) f ( R LLM θ ( ( t ( j ) ) t [ T ] ) , k ) ] 𝔼 [ f ( R LLM θ ( ( i ( j ) ) t [ T ] ) , k ) ] . absent 𝔼 delimited-[] subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 𝔼 delimited-[] 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑖 𝑗 𝑡 delimited-[] 𝑇 𝑘 \displaystyle=\frac{\mathbb{E}\left[h\left(R_{\text{LLM}_{\theta}}\left((\ell_{t}^{(j)})_{t\in[T]}\right)\right)f\left(R_{\text{LLM}_{\theta}}\left((\ell_{t}^{(j)})_{t\in[T]}\right),k\right)\right]}{\mathbb{E}\left[f\left(R_{\text{LLM}_{\theta}}\left((\ell_{i}^{(j)})_{t\in[T]}\right),k\right)\right]}.

Thus, lim N ( θ , k , N ) subscript 𝑁 𝜃 𝑘 𝑁 \lim_{N\to\infty}\mathcal{L}(\theta,k,N) always exists for every k 𝑘 k . Now, we use the known property of double iterated limit ( Lemma 5 ), and obtain that lim k lim N ( θ , k , N ) = h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑘 subscript 𝑁 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{k\to\infty}\lim_{N\to\infty}\mathcal{L}(\theta,k,N)=h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})) . ∎

Claim 2 (Uniform convergence of ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) (with respect to k 𝑘 k and N 𝑁 N )) .

( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) uniformly converges to h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 h(\max_{\ell_{1},\dots,\ell_{T}}\emph{Regret}_{\emph{LLM}_{\theta}}((\ell_{t})_{t\in[T]})) on the domain Θ Θ \Theta .

Proof.

We will provide a similar analysis as Lemma 8 as follows:

Lemma 9 .

For any 0 < ϵ < 1 0 italic-ϵ 1 0<\epsilon<1 , 0 < δ < 1 0 𝛿 1 0<\delta<1 , and k + 𝑘 superscript k\in\mathbb{N}^{+} , we have

i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) > 1 ϵ ) = 𝒪 ~ ( A ( k , H , ϵ ) ( 1 1 F H , X ( 1 ϵ / 2 ) + 1 N ) ) superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ ~ 𝒪 𝐴 𝑘 𝐻 italic-ϵ 1 1 subscript 𝐹 𝐻 𝑋 1 italic-ϵ 2 1 𝑁 \frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})>1-\epsilon)}=\widetilde{\mathcal{O}}\left(A(k,H,\epsilon)\left(\frac{1}{1-F_{H,X}(1-\epsilon/2)}+\frac{1}{\sqrt{N}}\right)\right)

with probability at least 1 δ 1 𝛿 1-\delta , where X i subscript 𝑋 𝑖 X_{i} ’s are i.i.d. random variables, esssup ( H ( X i ) ) = 1 esssup 𝐻 subscript 𝑋 𝑖 1 \operatorname{esssup}(H(X_{i}))=1 , H : + : 𝐻 superscript H:\mathbb{R}\to\mathbb{R}^{+} is a continuous non-decreasing function, A ( k , t , ϵ ) := ( 1 ϵ ) f ( ( t / esssup ( t ( X ) ) ) 1 ( 1 ϵ ) , k ) ( 1 ϵ / 2 ) f ( ( t / esssup ( t ( X ) ) ) 1 ( 1 ϵ / 2 ) , k ) assign 𝐴 𝑘 𝑡 italic-ϵ 1 italic-ϵ 𝑓 superscript 𝑡 esssup 𝑡 𝑋 1 1 italic-ϵ 𝑘 1 italic-ϵ 2 𝑓 superscript 𝑡 esssup 𝑡 𝑋 1 1 italic-ϵ 2 𝑘 A(k,t,\epsilon):=\frac{(1-\epsilon)f((t/\operatorname{esssup}(t(X)))^{-1}(1-\epsilon),k)}{(1-\epsilon/2)f((t/\operatorname{esssup}(t(X)))^{-1}(1-\epsilon/2),k)} , for any non-decreasing function t : + : 𝑡 superscript t:\mathbb{R}\to\mathbb{R}^{+} , and F t , X subscript 𝐹 𝑡 𝑋 F_{t,X} is a cumulative distribution function of random variable t ( X ) / esssup ( t ( X ) ) 𝑡 𝑋 esssup 𝑡 𝑋 t(X)/\text{esssup}(t(X)) .

Proof of Lemma 9 .

With the same argument as the proof of Lemma 8 , we have

i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) > 1 ϵ / 2 ) f ( H 1 ( 1 ϵ ) , k ) | { i [ N ] ( H ( X i ) < 1 ϵ ) } | f ( H 1 ( 1 ϵ / 2 ) , k ) | { i [ N ] ( H ( X i ) > 1 ϵ / 2 ) } | . superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 𝑓 superscript 𝐻 1 1 italic-ϵ 𝑘 conditional-set 𝑖 delimited-[] 𝑁 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 𝑓 superscript 𝐻 1 1 italic-ϵ 2 𝑘 conditional-set 𝑖 delimited-[] 𝑁 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 \displaystyle\frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})>1-\epsilon/2)}\leq\frac{f(H^{-1}(1-\epsilon),k)|\{i\in[N]\mid(H(X_{i})<1-\epsilon)\}|}{f(H^{-1}(1-\epsilon/2),k)|\{i\in[N]\mid(H(X_{i})>1-\epsilon/2)\}|}.

It holds that 1 N | { i [ N ] ( H ( X i ) < 1 ϵ ) } | = F H , X ( 1 ϵ ) + 𝒪 ~ ( 1 / N ) 1 𝑁 conditional-set 𝑖 delimited-[] 𝑁 𝐻 subscript 𝑋 𝑖 1 italic-ϵ subscript 𝐹 𝐻 𝑋 1 italic-ϵ ~ 𝒪 1 𝑁 \frac{1}{N}|\{i\in[N]\mid(H(X_{i})<1-\epsilon)\}|=F_{H,X}(1-\epsilon)+\widetilde{\mathcal{O}}(1/\sqrt{N}) with probability at least 1 δ / 2 1 𝛿 2 1-\delta/2 due to Hoeffding’s inequality ( Lemma 6 ). Similarly, we have 1 N | { i [ N ] ( H ( X i ) > 1 ϵ / 2 ) } | = 1 F H , X ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 / N ) 1 𝑁 conditional-set 𝑖 delimited-[] 𝑁 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 1 subscript 𝐹 𝐻 𝑋 1 italic-ϵ 2 ~ 𝒪 1 𝑁 \frac{1}{N}|\{i\in[N]\mid(H(X_{i})>1-\epsilon/2)\}|=1-F_{H,X}(1-\epsilon/2)+\widetilde{\mathcal{O}}(1/\sqrt{N}) with probability at least 1 δ / 2 1 𝛿 2 1-\delta/2 . Therefore,

| { i [ N ] ( H ( X i ) < 1 ϵ ) } | | { i [ N ] ( H ( X i ) > 1 ϵ / 2 ) } | = F H , X ( 1 ϵ ) 1 F H , X ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 / N ) 1 1 F H , X ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 / N ) , conditional-set 𝑖 delimited-[] 𝑁 𝐻 subscript 𝑋 𝑖 1 italic-ϵ conditional-set 𝑖 delimited-[] 𝑁 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 subscript 𝐹 𝐻 𝑋 1 italic-ϵ 1 subscript 𝐹 𝐻 𝑋 1 italic-ϵ 2 ~ 𝒪 1 𝑁 1 1 subscript 𝐹 𝐻 𝑋 1 italic-ϵ 2 ~ 𝒪 1 𝑁 \displaystyle\frac{|\{i\in[N]\mid(H(X_{i})<1-\epsilon)\}|}{|\{i\in[N]\mid(H(X_{i})>1-\epsilon/2)\}|}=\frac{F_{H,X}(1-\epsilon)}{1-F_{H,X}(1-\epsilon/2)}+\widetilde{\mathcal{O}}(\sqrt{1/N})\leq\frac{1}{1-F_{H,X}(1-\epsilon/2)}+\widetilde{\mathcal{O}}(\sqrt{1/N}),

with probability at least 1 δ 1 𝛿 1-\delta . Finally, we have

i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) > 1 ϵ ) < i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) < 1 ϵ ) i = 1 N f ( X i , k ) 𝟙 ( H ( X i ) > 1 ϵ / 2 ) superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 𝑘 1 𝐻 subscript 𝑋 𝑖 1 italic-ϵ 2 \displaystyle\frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})>1-\epsilon)}<\frac{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})<1-\epsilon)}{\sum_{i=1}^{N}f(X_{i},k)\mathbbm{1}(H(X_{i})>1-\epsilon/2)} A ( k , H , ϵ ) ( 1 1 F H , X ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 N ) ) . absent 𝐴 𝑘 𝐻 italic-ϵ 1 1 subscript 𝐹 𝐻 𝑋 1 italic-ϵ 2 ~ 𝒪 1 𝑁 \displaystyle\leq A(k,H,\epsilon)\left(\frac{1}{1-F_{H,X}(1-\epsilon/2)}+\widetilde{\mathcal{O}}(\frac{1}{\sqrt{N}})\right).

Note that lim k A ( k , H , ϵ ) = 0 subscript 𝑘 𝐴 𝑘 𝐻 italic-ϵ 0 \lim_{k\to\infty}A(k,H,\epsilon)=0 , since lim k f ( R 1 , k ) f ( R 2 , k ) = 𝟙 ( R 1 > R 2 ) + 𝟙 ( R 1 = R 2 ) subscript 𝑘 𝑓 subscript 𝑅 1 𝑘 𝑓 subscript 𝑅 2 𝑘 1 subscript 𝑅 1 subscript 𝑅 2 1 subscript 𝑅 1 subscript 𝑅 2 \lim_{k\to\infty}\frac{f(R_{1},k)}{f(R_{2},k)}=\infty\cdot\mathbbm{1}(R_{1}>R_{2})+\mathbbm{1}(R_{1}=R_{2}) . By Lemma 9 with H ( R LLM θ ( ( t ) t [ T ] ) ) = h ( R LLM θ ( ( t ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) 𝐻 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 H(R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))=\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))} , we have

i = 1 N f ( R LLM θ ( ( t ( i ) ) t [ T ] ) , k ) 𝟙 ( h ( R LLM θ ( ( t ( i ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) 1 ϵ ) i = 1 N f ( R LLM θ ( ( t ( i ) ) t [ T ] ) , k ) superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 𝑘 1 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 𝑘 \displaystyle\frac{\sum_{i=1}^{N}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}),k)\mathbbm{1}\left(\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}\geq 1-\epsilon\right)}{\sum_{i=1}^{N}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}),k)}
= 1 1 + i = 1 N f ( R LLM θ ( ( t ( i ) ) t [ T ] ) , k ) 𝟙 ( h ( R LLM θ ( ( t ( i ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) < 1 ϵ ) i = 1 N f ( R LLM θ ( ( t ( i ) ) t [ T ] ) , k ) 𝟙 ( h ( R LLM θ ( ( t ( i ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) 1 ϵ ) 1 1 + A ( k , H , ϵ ) ( 1 1 F H , R LLM θ ( ( t ) t [ T ] ) ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 / N ) ) , absent 1 1 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 𝑘 1 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 𝑘 1 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ 1 1 𝐴 𝑘 𝐻 italic-ϵ 1 1 subscript 𝐹 𝐻 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ 2 ~ 𝒪 1 𝑁 \displaystyle\quad=\frac{1}{1+\frac{\sum_{i=1}^{N}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}),k)\mathbbm{1}\left(\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}<1-\epsilon\right)}{\sum_{i=1}^{N}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}),k)\mathbbm{1}\left(\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}\geq 1-\epsilon\right)}}\geq\frac{1}{1+A(k,H,\epsilon)(\frac{1}{1-F_{H,R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})}(1-\epsilon/2)}+\widetilde{\mathcal{O}}(\sqrt{1/N}))},

where we recall the shorthand notation of R LLM θ = Regret LLM θ subscript 𝑅 subscript LLM 𝜃 subscript Regret subscript LLM 𝜃 R_{\text{LLM}_{\theta}}=\text{Regret}_{\text{LLM}_{\theta}} . Note that A ( k , H , ϵ ) = A ( k , h , ϵ ) 𝐴 𝑘 𝐻 italic-ϵ 𝐴 𝑘 italic-ϵ A(k,H,\epsilon)=A(k,h,\epsilon) and F H , R LLM θ = F h , R LLM θ subscript 𝐹 𝐻 subscript 𝑅 subscript LLM 𝜃 subscript 𝐹 subscript 𝑅 subscript LLM 𝜃 F_{H,R_{\text{LLM}_{\theta}}}=F_{h,R_{\text{LLM}_{\theta}}} hold by the definitions of F t , X subscript 𝐹 𝑡 𝑋 F_{t,X} and A ( k , t , ϵ ) 𝐴 𝑘 𝑡 italic-ϵ A(k,t,\epsilon) in Lemma 9 . Therefore,

1 1 \displaystyle 1 i = 1 N f ( R LLM θ ( ( t ( i ) ) t [ T ] ) , k ) h ( R LLM θ ( ( t ( i ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) i = 1 N f ( R LLM θ ( ( t ( i ) ) t [ T ] ) , k ) absent superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 𝑘 \displaystyle\geq\frac{\sum_{i=1}^{N}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}),k)\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}}{\sum_{i=1}^{N}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}),k)}
i = 1 N f ( R LLM θ ( ( t ( i ) ) t [ T ] ) , k ) h ( R LLM θ ( ( t ( i ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) 𝟙 ( h ( R LLM θ ( ( t ( i ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) 1 ϵ ) i = 1 N f ( R LLM θ ( ( t ( i ) ) t [ T ] ) , k ) 𝟙 ( h ( R LLM θ ( ( t ( i ) ) t [ T ] ) ) h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) 1 ϵ ) absent superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 𝑘 1 subscript 𝑅 subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑖 𝑡 delimited-[] 𝑇 subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ \displaystyle\geq\frac{\sum_{i=1}^{N}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}),k)\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}\mathbbm{1}(\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}\geq 1-\epsilon)}{\sum_{i=1}^{N}f(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}),k)\mathbbm{1}(\frac{h(R_{\text{LLM}_{\theta}}((\ell_{t}^{(i)})_{t\in[T]}))}{h(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))}\geq 1-\epsilon)}
× 1 1 + A ( k , h , ϵ ) ( 1 1 F h , R LLM θ ( ( t ) t [ T ] ) ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 / N ) ) absent 1 1 𝐴 𝑘 italic-ϵ 1 1 subscript 𝐹 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ 2 ~ 𝒪 1 𝑁 \displaystyle\qquad\times\frac{1}{1+A(k,h,\epsilon)(\frac{1}{1-F_{h,R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})}(1-\epsilon/2)}+\widetilde{\mathcal{O}}(\sqrt{1/N}))}
1 ϵ 1 + A ( k , h , ϵ ) ( 1 1 F h , R LLM θ ( ( t ) t [ T ] ) ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 / N ) ) absent 1 italic-ϵ 1 𝐴 𝑘 italic-ϵ 1 1 subscript 𝐹 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ 2 ~ 𝒪 1 𝑁 \displaystyle\geq\frac{1-\epsilon}{1+A(k,h,\epsilon)(\frac{1}{1-F_{h,R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})}(1-\epsilon/2)}+\widetilde{\mathcal{O}}(\sqrt{1/N}))}

with probability at least 1 δ 1 𝛿 1-\delta .

Now, for any ϵ > 0 italic-ϵ 0 \epsilon>0 and δ > 0 𝛿 0 \delta>0 , we have

0 0 \displaystyle 0 h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) ( θ , k , N ) absent subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 𝜃 𝑘 𝑁 \displaystyle\leq h\left(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)-\mathcal{L}(\theta,k,N)
h ( max 1 , , T R LLM θ ( ( t ) t [ T ] ) ) ( 1 ( 1 δ ) ( 1 ϵ ) 1 + A ( k , h , ϵ ) ( 1 1 F h , R LLM θ ( ( t ) t [ T ] ) ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 / N ) ) ) . absent subscript subscript 1 subscript 𝑇 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 1 𝛿 1 italic-ϵ 1 𝐴 𝑘 italic-ϵ 1 1 subscript 𝐹 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ 2 ~ 𝒪 1 𝑁 \displaystyle\leq h\left(\max_{\ell_{1},\dots,\ell_{T}}R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)\left(1-\frac{(1-\delta)(1-\epsilon)}{1+A(k,h,\epsilon)(\frac{1}{1-F_{h,R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})}(1-\epsilon/2)}+\widetilde{\mathcal{O}}(\sqrt{1/N}))}\right).

Note that

1 F h , R LLM θ ( ( t ) t [ T ] ) ( 1 ϵ / 2 ) = ( h ( Regret LLM θ ( ( t ) t [ T ] ) ) > ( 1 ϵ / 2 ) h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) ) 1 subscript 𝐹 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ 2 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ 2 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1-F_{h,R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})}(1-\epsilon/2)=\mathbb{P}\left(h\left(\text{Regret}_{\text{LLM}_{\theta}}\left((\ell_{t})_{t\in[T]}\right)\right)>(1-\epsilon/2)h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}\left((\ell_{t})_{t\in[T]}\right)\right)\right)

is a continuous function of θ 𝜃 \theta , since we assume LLM θ subscript LLM 𝜃 \text{LLM}_{\theta} is a continuous function of θ 𝜃 \theta , ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 (\ell_{t})_{t\in[T]} has a continuous distribution, and Regret LLM θ ( ( t ) t [ T ] ) subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}) is a continuous function of LLM θ subscript LLM 𝜃 \text{LLM}_{\theta} and ( t ) t [ T ] subscript subscript 𝑡 𝑡 delimited-[] 𝑇 (\ell_{t})_{t\in[T]} . Since we consider a compact Θ Θ \Theta (as several recent works on analyzing Transformers (Bai et al., 2023 ; Lin et al., 2024 ) ), we have p ( ϵ ) := min θ Θ 1 F h , R LLM θ ( ( t ) t [ T ] ) ( 1 ϵ / 2 ) > 0 assign 𝑝 italic-ϵ subscript 𝜃 Θ 1 subscript 𝐹 subscript 𝑅 subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 1 italic-ϵ 2 0 p(\epsilon):=\min_{\theta\in\Theta}1-F_{h,R_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})}(1-\epsilon/2)>0 . Therefore,

( 1 ( 1 δ ) ( 1 ϵ ) 1 + A ( k , h , ϵ ) ( 1 1 F h , R LLM θ ( 1 ϵ / 2 ) + 𝒪 ~ ( 1 / N ) ) ) ( 1 ( 1 δ ) ( 1 ϵ ) 1 + A ( k , h , ϵ ) ( 1 p ( ϵ ) + 𝒪 ~ ( 1 / N ) ) ) , 1 1 𝛿 1 italic-ϵ 1 𝐴 𝑘 italic-ϵ 1 1 subscript 𝐹 subscript 𝑅 subscript LLM 𝜃 1 italic-ϵ 2 ~ 𝒪 1 𝑁 1 1 𝛿 1 italic-ϵ 1 𝐴 𝑘 italic-ϵ 1 𝑝 italic-ϵ ~ 𝒪 1 𝑁 \displaystyle\left(1-\frac{(1-\delta)(1-\epsilon)}{1+A(k,h,\epsilon)(\frac{1}{1-F_{h,R_{\text{LLM}_{\theta}}}(1-\epsilon/2)}+\widetilde{\mathcal{O}}(\sqrt{1/N}))}\right)\leq\left(1-\frac{(1-\delta)(1-\epsilon)}{1+A(k,h,\epsilon)(\frac{1}{p(\epsilon)}+\widetilde{\mathcal{O}}(\sqrt{1/N}))}\right), (15)

and we know that lim N , k 1 + A ( k , h , ϵ ) ( 1 p ( ϵ ) + 𝒪 ~ ( 1 / N ) ) = 1 subscript 𝑁 𝑘 1 𝐴 𝑘 italic-ϵ 1 𝑝 italic-ϵ ~ 𝒪 1 𝑁 1 \lim_{N,k\to\infty}1+A(k,h,\epsilon)(\frac{1}{p(\epsilon)}+\widetilde{\mathcal{O}}(\sqrt{1/N}))=1 , which is not dependent on θ 𝜃 \theta . Thus, we can conclude that lim N , k sup θ Θ | h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) ( θ , k , N ) | = 0 subscript 𝑁 𝑘 subscript supremum 𝜃 Θ subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 𝜃 𝑘 𝑁 0 \lim_{N,k\to\infty}\sup_{\theta\in\Theta}|h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))-\mathcal{L}(\theta,k,N)|=0 , as we can choose arbitrarily small ϵ , δ italic-ϵ 𝛿 \epsilon,\delta . ∎

Claim 3 (Double iterated limit of supremum) .

It holds that:

lim N lim k sup θ Θ | ( θ , k , N ) h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) | = 0 . subscript 𝑁 subscript 𝑘 subscript supremum 𝜃 Θ 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 0 \lim_{N\to\infty}\lim_{k\to\infty}\sup_{\theta\in\Theta}~{}~{}\Big{|}\mathcal{L}(\theta,k,N)-h\left(\max_{\ell_{1},\dots,\ell_{T}}\emph{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)\Big{|}=0.
Proof.

Since h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) ( θ , k , N ) subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 𝜃 𝑘 𝑁 h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}))\geq\mathcal{L}(\theta,k,N) , we will prove

lim N lim k sup θ Θ h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) ( θ , k , N ) = 0 . subscript 𝑁 subscript 𝑘 subscript supremum 𝜃 Θ subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 𝜃 𝑘 𝑁 0 \lim_{N\to\infty}\lim_{k\to\infty}\sup_{\theta\in\Theta}~{}~{}h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)-\mathcal{L}(\theta,k,N)=0.
Lemma 10 .

i = 1 N f ( X i , k 1 ) h ( X i ) i = 1 N f ( X i , k 1 ) i = 1 N f ( X i , k 2 ) h ( X i ) i = 1 N f ( X i , k 2 ) superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 1 subscript 𝑋 𝑖 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 1 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 2 subscript 𝑋 𝑖 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 2 \frac{\sum_{i=1}^{N}f(X_{i},k_{1})h(X_{i})}{\sum_{i=1}^{N}f(X_{i},k_{1})}\leq\frac{\sum_{i=1}^{N}f(X_{i},k_{2})h(X_{i})}{\sum_{i=1}^{N}f(X_{i},k_{2})} holds if 0 < k 1 k 2 0 subscript 𝑘 1 subscript 𝑘 2 {0<}k_{1}\leq k_{2} for any real-valued ( X i ) i [ N ] subscript subscript 𝑋 𝑖 𝑖 delimited-[] 𝑁 (X_{i})_{i\in[N]} .

Proof.

By multiplying ( i = 1 N f ( X i , k 1 ) ) ( i = 1 N f ( X i , k 2 ) ) superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 1 superscript subscript 𝑖 1 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 2 ({\sum_{i=1}^{N}f(X_{i},k_{1})})({\sum_{i=1}^{N}f(X_{i},k_{2})}) on both sides of the formula, we know that it is equivalent to 1 i j N f ( X i , k 1 ) h ( X i ) f ( X j , k 2 ) 1 i j N f ( X i , k 1 ) h ( X j ) f ( X j , k 2 ) subscript 1 𝑖 𝑗 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 1 subscript 𝑋 𝑖 𝑓 subscript 𝑋 𝑗 subscript 𝑘 2 subscript 1 𝑖 𝑗 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 1 subscript 𝑋 𝑗 𝑓 subscript 𝑋 𝑗 subscript 𝑘 2 \sum_{1\leq i\neq j\leq N}f(X_{i},k_{1})h(X_{i})f(X_{j},k_{2})\leq\sum_{1\leq i\neq j\leq N}f(X_{i},k_{1})h(X_{j})f(X_{j},k_{2}) . This is equivalent to

1 i j N ( f ( X i , k 1 ) f ( X j , k 2 ) f ( X j , k 1 ) f ( X i , k 2 ) ) ( h ( X i ) h ( X j ) ) 0 , subscript 1 𝑖 𝑗 𝑁 𝑓 subscript 𝑋 𝑖 subscript 𝑘 1 𝑓 subscript 𝑋 𝑗 subscript 𝑘 2 𝑓 subscript 𝑋 𝑗 subscript 𝑘 1 𝑓 subscript 𝑋 𝑖 subscript 𝑘 2 subscript 𝑋 𝑖 subscript 𝑋 𝑗 0 \displaystyle\sum_{1\leq i\neq j\leq N}(f(X_{i},k_{1})f(X_{j},k_{2})-f(X_{j},k_{1})f(X_{i},k_{2}))(h(X_{i})-h(X_{j}))\leq 0,

which is true since if X i X j subscript 𝑋 𝑖 subscript 𝑋 𝑗 X_{i}\geq X_{j} , ( f ( X i , k 1 ) f ( X j , k 2 ) f ( X j , k 1 ) f ( X i , k 2 ) ) 0 𝑓 subscript 𝑋 𝑖 subscript 𝑘 1 𝑓 subscript 𝑋 𝑗 subscript 𝑘 2 𝑓 subscript 𝑋 𝑗 subscript 𝑘 1 𝑓 subscript 𝑋 𝑖 subscript 𝑘 2 0 (f(X_{i},k_{1})f(X_{j},k_{2})-f(X_{j},k_{1})f(X_{i},k_{2}))\leq 0 due to the log-increasing difference of f 𝑓 f ( 1 ), as log f ( X j , k 1 ) log f ( X j , k 2 ) log f ( X i , k 1 ) log f ( X i , k 2 ) 𝑓 subscript 𝑋 𝑗 subscript 𝑘 1 𝑓 subscript 𝑋 𝑗 subscript 𝑘 2 𝑓 subscript 𝑋 𝑖 subscript 𝑘 1 𝑓 subscript 𝑋 𝑖 subscript 𝑘 2 \log f(X_{j},k_{1})-\log f(X_{j},k_{2})\geq\log f(X_{i},k_{1})-\log f(X_{i},k_{2}) if X i X j subscript 𝑋 𝑖 subscript 𝑋 𝑗 X_{i}\geq X_{j} . ∎

Therefore, ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) is a non-decreasing function of k 𝑘 k if N 𝑁 N is fixed, which indicates that

lim k sup θ Θ h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) ( θ , k , N ) subscript 𝑘 subscript supremum 𝜃 Θ subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 𝜃 𝑘 𝑁 \lim_{k\to\infty}~{}~{}\sup_{\theta\in\Theta}~{}~{}h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)-\mathcal{L}(\theta,k,N)

exists, as ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) is also bounded. Therefore, by Lemma 5 and 2 , we know that

lim N lim k sup θ Θ | ( θ , k , N ) h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) | subscript 𝑁 subscript 𝑘 subscript supremum 𝜃 Θ 𝜃 𝑘 𝑁 subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N\to\infty}\lim_{k\to\infty}~{}~{}\sup_{\theta\in\Theta}~{}~{}\Big{|}\mathcal{L}(\theta,k,N)-h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)\Big{|}

exists and this value should be 0. ∎

Claim 4 .

It holds that

lim N , k inf θ Θ ( θ , k , N ) = lim N lim k inf θ Θ ( θ , k , N ) = inf θ Θ h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) . subscript 𝑁 𝑘 subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 subscript 𝑁 subscript 𝑘 subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 subscript infimum 𝜃 Θ subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N,k\to\infty}\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N)=\lim_{N\to\infty}\lim_{k\to\infty}\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N)=\inf_{\theta\in\Theta}h\left(\max_{\ell_{1},\dots,\ell_{T}}\emph{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right).
Proof.

Firstly, by Lemma 7 , we have lim N , k inf θ Θ ( θ , k , N ) = inf θ Θ h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑁 𝑘 subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 subscript infimum 𝜃 Θ subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N,k\to\infty}\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N)=\inf_{\theta\in\Theta}h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})) . Plus, we already know that ( θ , k , N ) 𝜃 𝑘 𝑁 \mathcal{L}(\theta,k,N) is a monotonically non-decreasing function of k 𝑘 k for any fixed N 𝑁 N ( Lemma 10 ), and it is bounded, lim k inf θ Θ ( θ , k , N ) subscript 𝑘 subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 \lim_{k\to\infty}\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N) always exists. Therefore, by Lemma 5 , we also have lim N lim k inf θ Θ ( θ , k , N ) = inf θ Θ h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑁 subscript 𝑘 subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 subscript infimum 𝜃 Θ subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \lim_{N\to\infty}\lim_{k\to\infty}\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N)=\inf_{\theta\in\Theta}h(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})) . ∎

D.3 Deferred Proofs of Theorem 2 and Corollary 1

See 2

Before proving the theorem, we remark on what LLM structure enjoys the Lipschitz-continuity. We provide two auxiliary results in the following proposition. The first result is from (Bai et al., 2023 , Section J.1) , which is about the Lipschitzness of Transformers. The second result is regarding processing the output of Transformers. In particular, the output of Transformers is usually not directly used, but passed through some matrix multiplication (by some matrix A 𝐴 A ), followed by some projection Operator (to be specified later).

Proposition 2 .

The L 𝐿 L -layer Transformer TF θ subscript TF 𝜃 \texttt{TF}_{\theta} as defined in Section A.1 is C TF subscript 𝐶 TF C_{\texttt{TF}} -Lipschitz continuous with respect to θ 𝜃 \theta with C TF := L ( ( 1 + B TF 2 ) ( 1 + B TF 2 R 3 ) ) L B TF R ( 1 + B TF R 2 + B TF 3 R 2 ) assign subscript 𝐶 TF 𝐿 superscript 1 superscript subscript 𝐵 TF 2 1 superscript subscript 𝐵 TF 2 superscript 𝑅 3 𝐿 subscript 𝐵 TF 𝑅 1 subscript 𝐵 TF superscript 𝑅 2 superscript subscript 𝐵 TF 3 superscript 𝑅 2 C_{\texttt{TF}}:=L\left((1+B_{\texttt{TF}}^{2})(1+B_{\texttt{TF}}^{2}R^{3})\right)^{L}B_{\texttt{TF}}R(1+B_{\texttt{TF}}R^{2}+B_{\texttt{TF}}^{3}R^{2}) , i.e.,

TF θ 1 ( Z ) TF θ 2 ( Z ) 2 , C TF θ 1 θ 2 TF subscript norm subscript TF subscript 𝜃 1 𝑍 subscript TF subscript 𝜃 2 𝑍 2 subscript 𝐶 TF subscript norm subscript 𝜃 1 subscript 𝜃 2 TF \displaystyle\|\texttt{TF}_{\theta_{1}}(Z)-\texttt{TF}_{\theta_{2}}(Z)\|_{2,\infty}\leq C_{\texttt{TF}}\|\theta_{1}-\theta_{2}\|_{\texttt{TF}}

where TF \|\cdot\|_{\texttt{TF}} is as defined in Equation 9 , and R , Z , B TF 𝑅 𝑍 subscript 𝐵 TF R,Z,B_{\texttt{TF}} are as introduced in Section A.1 . Moreover, the function Operator ( A TF θ ( ) 1 ) Operator 𝐴 subscript TF 𝜃 subscript 1 \texttt{Operator}(A\cdot\texttt{TF}_{\theta}(\cdot)_{-1}) is A op C TF subscript norm 𝐴 op subscript 𝐶 TF \|A\|_{\texttt{op}}C_{\texttt{TF}} -Lipschitz continuous with respect to θ 𝜃 \theta , i.e.,

Operator ( A TF θ 1 ( Z ) 1 ) Operator ( A TF θ 2 ( Z ) 1 ) 2 A op C TF θ 1 θ 2 TF . subscript norm Operator 𝐴 subscript TF subscript 𝜃 1 subscript 𝑍 1 Operator 𝐴 subscript TF subscript 𝜃 2 subscript 𝑍 1 2 subscript norm 𝐴 op subscript 𝐶 TF subscript norm subscript 𝜃 1 subscript 𝜃 2 TF \displaystyle\|\texttt{Operator}(A\cdot\texttt{TF}_{\theta_{1}}(Z)_{-1})-\texttt{Operator}(A\cdot\texttt{TF}_{\theta_{2}}(Z)_{-1})\|_{2}\leq\|A\|_{\texttt{op}}C_{\texttt{TF}}\|\theta_{1}-\theta_{2}\|_{\texttt{TF}}.

Here, Operator is either the projection operator onto some convex set, or the Softmax function.

Proof.

The first result is from (Bai et al., 2023 , Section J.1) . The second result comes from

  • If Operator is a projection onto the convex set, then Operator ( x ) Operator ( y ) 2 x y 2 subscript norm Operator 𝑥 Operator 𝑦 2 subscript norm 𝑥 𝑦 2 \|\texttt{Operator}(x)-\texttt{Operator}(y)\|_{2}\leq\|x-y\|_{2} ;

  • If Operator is Softmax , then Softmax ( x ) Softmax ( y ) 2 x y 2 subscript norm Softmax 𝑥 Softmax 𝑦 2 subscript norm 𝑥 𝑦 2 \|\texttt{Softmax}(x)-\texttt{Softmax}(y)\|_{2}\leq\|x-y\|_{2} (Gao and Pavel, 2017 , Corollary 3) .

Note that the only condition that we require for Operator is its non-expansiveness. ∎

Proof of Theorem 2 .

Let C LLM subscript 𝐶 LLM C_{\text{LLM}} denote the Lipschitz-continuity constant for LLM θ subscript LLM 𝜃 \text{LLM}_{\theta} with respect to some norm LLM \|\cdot\|_{\text{LLM}} , where LLM \|\cdot\|_{\text{LLM}} denotes any norm defined on the parameter space of LLM (e.g., the norm TF \|\cdot\|_{\texttt{TF}} above in Proposition 2 ). Now, we prove that regret is also a Lipschitz-continuous function with respect to the LLM’s parameter.

Lemma 11 (Lipschitzness of regret) .

The function Regret LLM θ subscript Regret subscript LLM 𝜃 \text{Regret}_{\text{LLM}_{\theta}} is C Reg := B C LLM T assign subscript 𝐶 Reg 𝐵 subscript 𝐶 LLM 𝑇 C_{\texttt{Reg}}:=BC_{\text{LLM}}T -Lipschitz continuous with respect to θ 𝜃 \theta , i.e.,

| Regret LLM θ 1 \displaystyle\Big{|}\emph{Regret}_{\text{LLM}_{\theta_{1}}} ( ( t ) t [ T ] ) Regret LLM θ 2 ( ( t ) t [ T ] ) | C Reg θ 1 θ 2 LLM . \displaystyle((\ell_{t})_{t\in[T]})-\emph{Regret}_{\text{LLM}_{\theta_{2}}}((\ell_{t})_{t\in[T]})\Big{|}\leq C_{\texttt{Reg}}\|\theta_{1}-\theta_{2}\|_{\text{LLM}}.
Proof.

By definition, we have

| Regret LLM θ 1 \displaystyle\Big{|}\text{Regret}_{\text{LLM}_{\theta_{1}}} ( ( t ) t [ T ] ) Regret LLM θ 2 ( ( t ) t [ T ] ) | = | t = 1 T t , LLM θ 1 ( Z t 1 ) LLM θ 2 ( Z t 1 ) | \displaystyle((\ell_{t})_{t\in[T]})-\text{Regret}_{\text{LLM}_{\theta_{2}}}((\ell_{t})_{t\in[T]})\Big{|}=\left|\sum_{t=1}^{T}\langle\ell_{t},\text{LLM}_{\theta_{1}}(Z_{t-1})-\text{LLM}_{\theta_{2}}(Z_{t-1})\rangle\right|
= B t = 1 T LLM θ 1 ( Z t 1 ) LLM θ 2 ( Z t 1 ) absent 𝐵 superscript subscript 𝑡 1 𝑇 norm subscript LLM subscript 𝜃 1 subscript 𝑍 𝑡 1 subscript LLM subscript 𝜃 2 subscript 𝑍 𝑡 1 \displaystyle=B\sum_{t=1}^{T}\|\text{LLM}_{\theta_{1}}(Z_{t-1})-\text{LLM}_{\theta_{2}}(Z_{t-1})\|
B C LLM T θ 1 θ 2 LLM absent 𝐵 subscript 𝐶 LLM 𝑇 subscript norm subscript 𝜃 1 subscript 𝜃 2 LLM \displaystyle\leq BC_{\text{LLM}}T\|\theta_{1}-\theta_{2}\|_{\text{LLM}}

where Z t := ( 1 , , t , c ) assign subscript 𝑍 𝑡 subscript 1 subscript 𝑡 𝑐 Z_{t}:=(\ell_{1},\dots,\ell_{t},c) for all t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] and Z 0 = ( c ) subscript 𝑍 0 𝑐 Z_{0}=(c) where c 𝑐 c is a d 𝑑 d -dimensional vector. ∎

Now, we will prove the Lipschitzness of

C ( ( t ( j ) ) t [ T ] , j [ N ] , k , θ ) := j [ N ] h ( Regret LLM θ ( ( t ( j ) ) t [ T ] ) ) f ( Regret LLM θ ( ( t ( j ) ) t [ T ] ) , k ) j [ N ] f ( Regret LLM θ ( ( t ( j ) ) t [ t ] ) , k ) assign 𝐶 subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 𝑘 𝜃 subscript 𝑗 delimited-[] 𝑁 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑓 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑗 delimited-[] 𝑁 𝑓 subscript Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑡 𝑘 \displaystyle C\left((\ell_{t}^{(j)})_{t\in[T],j\in[N]},k,\theta\right):=\frac{\sum_{j\in[N]}h(\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))f(\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}),k)}{\sum_{j\in[N]}f(\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[t]}),k)} (16)

with respect to the model parameter θ 𝜃 \theta .

Claim 5 .

For any R > 0 𝑅 0 R>0 , there exists β R > 0 subscript 𝛽 𝑅 0 \beta_{R}>0 such that if β > β R 𝛽 subscript 𝛽 𝑅 \beta>\beta_{R} , we have

| n [ N ] x n f ( x n , β ) n [ N ] f ( x n , β ) n [ N ] y n f ( y n , β ) n [ N ] f ( y n , β ) | 2 x y subscript 𝑛 delimited-[] 𝑁 subscript 𝑥 𝑛 𝑓 subscript 𝑥 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 𝑓 subscript 𝑥 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 subscript 𝑦 𝑛 𝑓 subscript 𝑦 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 𝑓 subscript 𝑦 𝑛 𝛽 2 subscript norm 𝑥 𝑦 \displaystyle\left|\frac{\sum_{n\in[N]}x_{n}f(x_{n},\beta)}{\sum_{n\in[N]}f(x_{n},\beta)}-\frac{\sum_{n\in[N]}y_{n}f(y_{n},\beta)}{\sum_{n\in[N]}f(y_{n},\beta)}\right|\leq 2\|x-y\|_{\infty}

for every x , y n 𝑥 𝑦 superscript 𝑛 x,y\in\mathbb{R}^{n} such that | x i | R , | y i | R formulae-sequence subscript 𝑥 𝑖 𝑅 subscript 𝑦 𝑖 𝑅 |x_{i}|\leq R,|y_{i}|\leq R for all i [ N ] 𝑖 delimited-[] 𝑁 i\in[N] .

Proof.

If β = 𝛽 \beta=\infty , we have

lim β subscript 𝛽 \displaystyle\lim_{\beta\to\infty} ( | n [ N ] x n f ( x n , β ) n [ N ] f ( x n , β ) n [ N ] y n f ( y n , β ) n [ N ] f ( y n , β ) | / x y ) = | max n [ N ] x n max n [ N ] y n | x y 1 subscript 𝑛 delimited-[] 𝑁 subscript 𝑥 𝑛 𝑓 subscript 𝑥 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 𝑓 subscript 𝑥 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 subscript 𝑦 𝑛 𝑓 subscript 𝑦 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 𝑓 subscript 𝑦 𝑛 𝛽 subscript norm 𝑥 𝑦 subscript 𝑛 delimited-[] 𝑁 subscript 𝑥 𝑛 subscript 𝑛 delimited-[] 𝑁 subscript 𝑦 𝑛 subscript norm 𝑥 𝑦 1 \displaystyle\left(\left|\frac{\sum_{n\in[N]}x_{n}f(x_{n},\beta)}{\sum_{n\in[N]}f(x_{n},\beta)}-\frac{\sum_{n\in[N]}y_{n}f(y_{n},\beta)}{\sum_{n\in[N]}f(y_{n},\beta)}\right|\Big{/}\|x-y\|_{\infty}\right)=\frac{|\max_{n\in[N]}x_{n}-\max_{n\in[N]}y_{n}|}{\|x-y\|_{\infty}}\leq 1

holds. Moreover, consider the following constrained optimization problem:

max x , y n ( | n [ N ] x n f ( x n , β ) n [ N ] f ( x n , β ) n [ N ] y n f ( y n , β ) n [ N ] f ( y n , β ) | / x y ) subscript 𝑥 𝑦 superscript 𝑛 subscript 𝑛 delimited-[] 𝑁 subscript 𝑥 𝑛 𝑓 subscript 𝑥 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 𝑓 subscript 𝑥 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 subscript 𝑦 𝑛 𝑓 subscript 𝑦 𝑛 𝛽 subscript 𝑛 delimited-[] 𝑁 𝑓 subscript 𝑦 𝑛 𝛽 subscript norm 𝑥 𝑦 \displaystyle\max_{{x,y\in\mathbb{R}^{n}}}\quad\left(\left|\frac{\sum_{n\in[N]}x_{n}f(x_{n},\beta)}{\sum_{n\in[N]}f(x_{n},\beta)}-\frac{\sum_{n\in[N]}y_{n}f(y_{n},\beta)}{\sum_{n\in[N]}f(y_{n},\beta)}\right|\Big{/}\|x-y\|_{\infty}\right)
subject to | x i | R , | y i | R for all i [ N ] , formulae-sequence subject to subscript 𝑥 𝑖 𝑅 subscript 𝑦 𝑖 𝑅 for all 𝑖 delimited-[] 𝑁 \displaystyle\qquad\text{subject to }\qquad|x_{i}|\leq R,~{}~{}~{}|y_{i}|\leq R~{}~{}~{}\text{ for all }i\in[N],

whose optimum is denoted as F ( R , β ) 𝐹 𝑅 𝛽 F(R,\beta) . Then, since x R subscript norm 𝑥 𝑅 \|x\|_{\infty}\leq R and y R subscript norm 𝑦 𝑅 \|y\|_{\infty}\leq R is a compact set, by Berge’s Maximum Theorem (Berge, 1877 ) , we have that F ( R , β ) 𝐹 𝑅 𝛽 F(R,\beta) is a continuous function for β 𝛽 \beta . Moreover, we know that F ( R , ) 1 𝐹 𝑅 1 F(R,\infty)\leq 1 , which indicates that we can find a large enough β R subscript 𝛽 𝑅 \beta_{R} such that if β > β R 𝛽 subscript 𝛽 𝑅 \beta>\beta_{R} , F ( R , β ) 2 𝐹 𝑅 𝛽 2 F(R,\beta)\leq 2 . ∎

Note that 5 does not hold if either x i subscript 𝑥 𝑖 x_{i} or y i subscript 𝑦 𝑖 y_{i} is unbounded. Now, we will apply 5 to Equation 16 . We can guarantee that | Regret LLM θ ( ( t ) t [ T ] ) | diam ( Π , 2 ) T B \left|\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right|\leq\text{diam}(\Pi,\|\cdot\|_{2})TB .

Also, note that the domain of h : + : superscript h:\mathbb{R}\to\mathbb{R}^{+} is effectively constrained to the range that Regret LLM θ ( ( t ) t [ T ] ) subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}) can achieve, which means that we can regard h h as h : [ diam ( Π , 2 ) T B , diam ( Π , 2 ) T B ] + h:[-\text{diam}(\Pi,\|\cdot\|_{2})TB,\text{diam}(\Pi,\|\cdot\|_{2})TB]\to\mathbb{R}^{+} . Due to the continuity of h superscript h^{\prime} , and the fact that h h has a compact domain, we know that h ( ) h(\cdot) is C h subscript 𝐶 C_{h} -Lipschitz continuous for some C h > 0 subscript 𝐶 0 C_{h}>0 on this interval of [ diam ( Π , 2 ) T B , diam ( Π , 2 ) T B ] [-\text{diam}(\Pi,\|\cdot\|_{2})TB,\text{diam}(\Pi,\|\cdot\|_{2})TB] .

Lemma 12 (Lipschitzness of C 𝐶 C in Equation 16 ) .

The function C 𝐶 C in Equation 16 is C cost := 2 C h C Reg assign subscript 𝐶 cost 2 subscript 𝐶 subscript 𝐶 Reg C_{\texttt{cost}}:=2C_{h}C_{\text{Reg}} -Lipschitz continuous with respect to θ 𝜃 \theta , if k > k diam ( Π , 2 ) T B k>k_{\text{diam}(\Pi,\|\cdot\|_{2})TB} for some k diam ( Π , 2 ) T B > 0 k_{\text{diam}(\Pi,\|\cdot\|_{2})TB}>0 , i.e.,

| C ( ( t ( j ) ) t [ T ] , j [ N ] , k , θ 1 ) C ( ( t ( j ) ) t [ T ] , j [ N ] , k , θ 2 ) | C cost θ 1 θ 2 LLM . 𝐶 subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 𝑘 subscript 𝜃 1 𝐶 subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 𝑘 subscript 𝜃 2 subscript 𝐶 cost subscript norm subscript 𝜃 1 subscript 𝜃 2 LLM \displaystyle\Big{|}C\left((\ell_{t}^{(j)})_{t\in[T],j\in[N]},k,\theta_{1}\right)-C\left((\ell_{t}^{(j)})_{t\in[T],j\in[N]},k,\theta_{2}\right)\Big{|}\leq C_{\texttt{cost}}\|\theta_{1}-\theta_{2}\|_{\text{LLM}}.
Proof.
| C ( ( \displaystyle\big{|}C(( t ( j ) ) t [ T ] , j [ N ] , k , θ 1 ) C ( ( t ( j ) ) t [ T ] , j [ N ] , k , θ 2 ) | \displaystyle\ell_{t}^{(j)})_{t\in[T],j\in[N]},k,\theta_{1})-C((\ell_{t}^{(j)})_{t\in[T],j\in[N]},k,\theta_{2})\big{|}
( i ) 2 h ( Regret LLM θ 1 ( ( t ( j ) ) t [ T ] ) ) h ( Regret LLM θ 2 ( ( t ( j ) ) t [ T ] ) ) 𝑖 2 subscript norm subscript Regret subscript LLM subscript 𝜃 1 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 subscript Regret subscript LLM subscript 𝜃 2 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 \displaystyle\underset{(i)}{\leq}2\|h(\text{Regret}_{\text{LLM}_{\theta_{1}}}((\ell_{t}^{(j)})_{t\in[T]}))-h(\text{Regret}_{\text{LLM}_{\theta_{2}}}((\ell_{t}^{(j)})_{t\in[T]}))\|_{\infty}
( i i ) 2 C h Regret LLM θ 1 ( ( t ( j ) ) t [ T ] ) Regret LLM θ 2 ( ( t ( j ) ) t [ T ] ) 𝑖 𝑖 2 subscript 𝐶 subscript norm subscript Regret subscript LLM subscript 𝜃 1 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 subscript Regret subscript LLM subscript 𝜃 2 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 \displaystyle\underset{(ii)}{\leq}2C_{h}\|\text{Regret}_{\text{LLM}_{\theta_{1}}}((\ell_{t}^{(j)})_{t\in[T]})-\text{Regret}_{\text{LLM}_{\theta_{2}}}((\ell_{t}^{(j)})_{t\in[T]})\|_{\infty}
( i i i ) 2 C h C Reg θ 1 θ 2 LLM = C cost θ 1 θ 2 LLM . 𝑖 𝑖 𝑖 2 subscript 𝐶 subscript 𝐶 Reg subscript norm subscript 𝜃 1 subscript 𝜃 2 LLM subscript 𝐶 cost subscript norm subscript 𝜃 1 subscript 𝜃 2 LLM \displaystyle\underset{(iii)}{\leq}2C_{h}C_{\text{Reg}}\|\theta_{1}-\theta_{2}\|_{\text{LLM}}=C_{\texttt{cost}}\|\theta_{1}-\theta_{2}\|_{\text{LLM}}.

Here, (i) holds due to 5 , (ii) holds since h h is C h subscript 𝐶 C_{h} -Lipschitz continuous on the range of Regret LLM θ ( ( t ) t [ T ] ) subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]}) , and (iii) holds due to Lemma 11 . ∎

For completeness of the paper, we provide the definition of covering set and covering number.

Definition 4 (Covering set and covering number) .

For δ > 0 𝛿 0 \delta>0 , a metric space ( X , ) (X,\|\cdot\|) , and subset Y X 𝑌 𝑋 Y\subseteq X , set C Y 𝐶 𝑌 C\subset Y is a δ 𝛿 \delta -covering of Y 𝑌 Y when Y c C B ( c , δ , ) Y\subseteq\cup_{c\in C}B(c,\delta,\|\cdot\|) holds. δ 𝛿 \delta -covering number N ( δ ; Y , ) N(\delta;Y,\|\cdot\|) is defined as the minimum cardinality of any covering set.

By (Wainwright, 2019 , Example 5.8) , for any r > 0 𝑟 0 r>0 , we can verify that the δ 𝛿 \delta -covering number N ( δ ; B ( 0 , r , LLM ) , LLM ) N(\delta;B(0,r,\|\cdot\|_{\text{LLM}}),\|\cdot\|_{\text{LLM}}) can be bounded by

log N ( δ ; B ( 0 , r , LLM ) , LLM ) d θ log ( 1 + 2 r / δ ) , \displaystyle\log N(\delta;B(0,r,\|\cdot\|_{\text{LLM}}),\|\cdot\|_{\text{LLM}})\leq d_{\theta}\log(1+2r/\delta),

where d θ subscript 𝑑 𝜃 d_{\theta} is the dimension of the LLM’s whole parameter. For example, if we use the TF \|\cdot\|_{\texttt{TF}} and consider the Transformer model as defined in Section A.1 , for any r > 0 𝑟 0 r>0 ,

log N ( δ ; B ( 0 , r , LLM ) , LLM ) L ( 3 M d 2 + 2 d ( d d + 3 m d 2 ) ) log ( 1 + 2 r / δ ) . \displaystyle\log N(\delta;B(0,r,\|\cdot\|_{\text{LLM}}),\|\cdot\|_{\text{LLM}})\leq L(3Md^{2}+2d(dd^{\prime}+3md^{2}))\log(1+2r/\delta).

Since we consider a compact Θ Θ \Theta (as several recent works on analyzing Transformers (Bai et al., 2023 ; Lin et al., 2024 ) ), let R Θ := diam ( Θ , LLM ) R_{\Theta}:=\text{diam}(\Theta,\|\cdot\|_{\text{LLM}}) (which corresponds to B TF subscript 𝐵 TF B_{\texttt{TF}} for the Transformer models as defined in Section A.1 , with LLM = TF \|\cdot\|_{\text{LLM}}=\|\cdot\|_{\texttt{TF}} ), then there exists a set Θ 0 subscript Θ 0 \Theta_{0} with log | Θ 0 | = d θ log ( 1 + 2 R Θ / δ ) subscript Θ 0 subscript 𝑑 𝜃 1 2 subscript 𝑅 Θ 𝛿 \log|\Theta_{0}|=d_{\theta}\log(1+2R_{\Theta}/\delta) such that for any θ Θ 𝜃 Θ \theta\in\Theta , there exists a θ 0 Θ 0 subscript 𝜃 0 subscript Θ 0 \theta_{0}\in\Theta_{0} with

| C ( ( t ( j ) ) t [ T ] , j [ N ] , k , θ ) C ( ( t ( j ) ) t [ T ] , j [ N ] , k , θ 0 ) | C cost δ . 𝐶 subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 𝑘 𝜃 𝐶 subscript superscript subscript 𝑡 𝑗 formulae-sequence 𝑡 delimited-[] 𝑇 𝑗 delimited-[] 𝑁 𝑘 subscript 𝜃 0 subscript 𝐶 cost 𝛿 \displaystyle\Big{|}C\left((\ell_{t}^{(j)})_{t\in[T],j\in[N]},k,\theta\right)-C\left((\ell_{t}^{(j)})_{t\in[T],j\in[N]},k,\theta_{0}\right)\Big{|}\leq C_{\texttt{cost}}\delta.

Then, by the standard result from statistical learning theory (Wainwright, 2019 , Chapter 5) , when trained with N T subscript 𝑁 𝑇 N_{T} samples, for every 0 < ϵ < 1 / 2 0 italic-ϵ 1 2 0<\epsilon<1/2 , with probability at least 1 ϵ 1 italic-ϵ 1-\epsilon , we have

( θ ^ k , N , N T , k , N ) subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 𝑘 𝑁 \displaystyle\mathcal{L}(\widehat{\theta}_{k,N,N_{T}},k,N) inf θ Θ ( θ , k , N ) 2 ( log | Θ 0 | + log ( 2 / ϵ ) ) N T + 2 C cost δ . subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 2 subscript Θ 0 2 italic-ϵ subscript 𝑁 𝑇 2 subscript 𝐶 cost 𝛿 \displaystyle-\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N)\leq\sqrt{\frac{2(\log|\Theta_{0}|+\log(2/\epsilon))}{N_{T}}}+2C_{\texttt{cost}}\delta.

Setting δ = Ω ( log ( ϵ ) / N T ) 𝛿 Ω italic-ϵ subscript 𝑁 𝑇 \delta=\Omega(\sqrt{\log(\epsilon)/N_{T}}) , we further obtain

( θ ^ k , N , N T , k , N ) subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 𝑘 𝑁 \displaystyle\mathcal{L}(\widehat{\theta}_{k,N,N_{T}},k,N) inf θ Θ ( θ , k , N ) 𝒪 ~ ( log | Θ 0 | + log ( 1 / ϵ ) N T ) subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 ~ 𝒪 subscript Θ 0 1 italic-ϵ subscript 𝑁 𝑇 \displaystyle-\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N)\leq\widetilde{\mathcal{O}}\left(\sqrt{\frac{\log|\Theta_{0}|+\log(1/\epsilon)}{N_{T}}}\right)

with probability at least 1 ϵ 1 italic-ϵ 1-\epsilon , completing the proof. ∎

See 1

Proof.

The limit on the right-hand side of Equation 5 remains as 𝒪 ~ ( d θ + log ( 1 / ϵ ) N T ) ~ 𝒪 subscript 𝑑 𝜃 1 italic-ϵ subscript 𝑁 𝑇 \widetilde{\mathcal{O}}\left(\sqrt{\frac{d_{\theta}+\log(1/\epsilon)}{N_{T}}}\right) , since we firstly take lim k subscript 𝑘 \lim_{k\to\infty} and then take lim N subscript 𝑁 \lim_{N\to\infty} , thanks to the fact that Theorem 2 holds for large enough k 𝑘 k and any N 𝑁 N . Next, we have

lim N lim k | ( θ ^ k , N , N T , k , N ) h ( lim N lim k max t B Regret LLM θ ^ k , N , N T ( ( t ) t [ T ] ) ) | subscript 𝑁 subscript 𝑘 subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 𝑘 𝑁 subscript 𝑁 subscript 𝑘 subscript subscript norm subscript 𝑡 𝐵 subscript Regret subscript LLM subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\lim_{N\to\infty}\lim_{k\to\infty}\bigg{|}\mathcal{L}(\widehat{\theta}_{k,N,N_{T}},k,N)-h\left(\lim_{N\to\infty}\lim_{k\to\infty}\max_{\|\ell_{t}\|_{\infty}\leq B}\text{Regret}_{\text{LLM}_{\widehat{\theta}_{k,N,N_{T}}}}((\ell_{t})_{t\in[T]})\right)\bigg{|}
lim N lim k | ( θ ^ k , N , N T , k , N ) h ( max t B Regret LLM θ ^ k , N , N T ( ( t ) t [ T ] ) ) | + absent limit-from subscript 𝑁 subscript 𝑘 subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 𝑘 𝑁 subscript subscript norm subscript 𝑡 𝐵 subscript Regret subscript LLM subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\quad\leq\lim_{N\to\infty}\lim_{k\to\infty}\bigg{|}\mathcal{L}(\widehat{\theta}_{k,N,N_{T}},k,N)-h\left(\max_{\|\ell_{t}\|_{\infty}\leq B}\text{Regret}_{\text{LLM}_{\widehat{\theta}_{k,N,N_{T}}}}((\ell_{t})_{t\in[T]})\right)\bigg{|}+
lim N lim k | h ( max t B Regret LLM θ ^ k , N , N T ( ( t ) t [ T ] ) ) h ( lim N lim k max t B Regret LLM θ ^ k , N , N T ( ( t ) t [ T ] ) ) | subscript 𝑁 subscript 𝑘 subscript subscript norm subscript 𝑡 𝐵 subscript Regret subscript LLM subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 subscript 𝑁 subscript 𝑘 subscript subscript norm subscript 𝑡 𝐵 subscript Regret subscript LLM subscript ^ 𝜃 𝑘 𝑁 subscript 𝑁 𝑇 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\quad\qquad\lim_{N\to\infty}\lim_{k\to\infty}\bigg{|}h\left(\max_{\|\ell_{t}\|_{\infty}\leq B}\text{Regret}_{\text{LLM}_{\widehat{\theta}_{k,N,N_{T}}}}((\ell_{t})_{t\in[T]})\right)-h\left(\lim_{N\to\infty}\lim_{k\to\infty}\max_{\|\ell_{t}\|_{\infty}\leq B}\text{Regret}_{\text{LLM}_{\widehat{\theta}_{k,N,N_{T}}}}((\ell_{t})_{t\in[T]})\right)\bigg{|}
lim N lim k sup θ Θ | ( θ , k , N ) h ( max t B Regret LLM θ ( ( t ) t [ T ] ) ) | + 0 = 0 , absent subscript 𝑁 subscript 𝑘 subscript supremum 𝜃 Θ 𝜃 𝑘 𝑁 subscript subscript norm subscript 𝑡 𝐵 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 0 0 \displaystyle\quad\leq\lim_{N\to\infty}\lim_{k\to\infty}\sup_{\theta\in\Theta}\bigg{|}\mathcal{L}({\theta},k,N)-h\left(\max_{\|\ell_{t}\|_{\infty}\leq B}\text{Regret}_{\text{LLM}_{{\theta}}}((\ell_{t})_{t\in[T]})\right)\bigg{|}+0=0,

due to the continuity of h h and 3 . Finally, we have

lim N lim k inf θ Θ ( θ , k , N ) = inf θ Θ h ( max 1 , , T Regret LLM θ ( ( t ) t [ T ] ) ) subscript 𝑁 subscript 𝑘 subscript infimum 𝜃 Θ 𝜃 𝑘 𝑁 subscript infimum 𝜃 Θ subscript subscript 1 subscript 𝑇 subscript Regret subscript LLM 𝜃 subscript subscript 𝑡 𝑡 delimited-[] 𝑇 \displaystyle\lim_{N\to\infty}\lim_{k\to\infty}\inf_{\theta\in\Theta}\mathcal{L}(\theta,k,N)=\inf_{\theta\in\Theta}h\left(\max_{\ell_{1},\dots,\ell_{T}}\text{Regret}_{\text{LLM}_{\theta}}((\ell_{t})_{t\in[T]})\right)

due to 4 , which, combined with the fact that h h is non-decreasing, completes the proof. ∎

Remark 5 (Dynamic-regret loss) .

So far, we have focused on the canonical online learning setting with regret being the metric. One can also generalize the results to the non-stationary setting, with dynamic regret being the metric. Specifically, one can define the dynamic-regret-loss function as follows:

( θ , k , N ) := 𝔼 [ j [ N ] h ( D-Regret LLM θ ( ( t ( j ) ) t [ T ] ) ) f ( D-Regret LLM θ ( ( t ( j ) ) t [ T ] ) , k ) j [ N ] f ( D-Regret LLM θ ( ( i ( j ) ) t [ T ] ) , k ) ] . assign 𝜃 𝑘 𝑁 𝔼 delimited-[] subscript 𝑗 delimited-[] 𝑁 subscript D-Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑓 subscript D-Regret subscript LLM 𝜃 subscript superscript subscript 𝑡 𝑗 𝑡 delimited-[] 𝑇 𝑘 subscript 𝑗 delimited-[] 𝑁 𝑓 subscript D-Regret subscript LLM 𝜃 subscript superscript subscript 𝑖 𝑗 𝑡 delimited-[] 𝑇 𝑘 \displaystyle\mathcal{L}(\theta,k,N):=\mathbb{E}\left[\frac{\sum_{j\in[N]}h(\text{D-Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}))f(\text{D-Regret}_{\text{LLM}_{\theta}}((\ell_{t}^{(j)})_{t\in[T]}),k)}{\sum_{j\in[N]}f(\text{D-Regret}_{\text{LLM}_{\theta}}((\ell_{i}^{(j)})_{t\in[T]}),k)}\right].

Then, one can also establish similar results as before, since the analysis does not utilize other properties of the regret except its boundedness, and the Lipschitz-continuity of LLM with respect to θ 𝜃 \theta . To be specific, Lemma 11 holds due to the reason that we can bound the difference of the regret with the term

| t = 1 T t , ( LLM θ 1 ( Z t 1 ) LLM θ 2 ( Z t 1 ) ) | , superscript subscript 𝑡 1 𝑇 subscript 𝑡 subscript LLM subscript 𝜃 1 subscript 𝑍 𝑡 1 subscript LLM subscript 𝜃 2 subscript 𝑍 𝑡 1 \left|\sum_{t=1}^{T}\langle\ell_{t},(\text{LLM}_{\theta_{1}}(Z_{t-1})-\text{LLM}_{\theta_{2}}(Z_{t-1}))\rangle\right|,

as well as the fact that inf π i Π i , π i subscript infimum subscript 𝜋 𝑖 Π subscript 𝑖 subscript 𝜋 𝑖 \inf_{\pi_{i}\in\Pi}\langle\ell_{i},\pi_{i}\rangle will be canceled. One can verify that all the arguments in Section D.2 also hold for similar reasons.

D.4 Deferred Proof of Theorem 3

See 3

Proof.

We will locally use 𝒜 = [ d ] 𝒜 delimited-[] 𝑑 \mathcal{A}=[d] without losing generality as 𝒜 𝒜 \mathcal{A} is finite with | 𝒜 | = d 𝒜 𝑑 |\mathcal{A}|=d , and will interchangeably use i ( j ) subscript 𝑖 𝑗 \ell_{i}(j) and i j subscript 𝑖 𝑗 \ell_{ij} for notational convenience. Define a := K ( Q c + q c ) d assign 𝑎 superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 superscript 𝑑 a:=K^{\intercal}(Qc+q_{c})\in\mathbb{R}^{d} and b t 1 := β 𝟏 t 1 := k c ( Q c + q c ) 𝟏 t 1 t 1 assign subscript 𝑏 𝑡 1 𝛽 subscript 1 𝑡 1 assign superscript subscript 𝑘 𝑐 𝑄 𝑐 subscript 𝑞 𝑐 subscript 1 𝑡 1 superscript 𝑡 1 b_{t-1}:=\beta\boldsymbol{1}_{t-1}:=k_{c}^{\intercal}(Qc+q_{c})\boldsymbol{1}_{t-1}\in\mathbb{R}^{t-1} . With N = 1 𝑁 1 N=1 , h ( x ) = x 2 𝑥 superscript 𝑥 2 h(x)=x^{2} , and the choice of Π Π \Pi , the loss function ( Equation 3 ) can be written as follows:

f ( V , a , ( b t ) t [ T 1 ] , v c ) := 𝔼 ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) 2 , assign 𝑓 𝑉 𝑎 subscript subscript 𝑏 𝑡 𝑡 delimited-[] 𝑇 1 subscript 𝑣 𝑐 𝔼 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 2 \displaystyle f(V,a,(b_{t})_{t\in[T-1]},v_{c}):=\mathbb{E}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)^{2},

where for t = 1 𝑡 1 t=1 , we use the output of the single-layer self-attention as v c subscript 𝑣 𝑐 v_{c} and we will write it as ( V 1 : 0 + v c 𝟏 0 ) Softmax ( 1 : 0 a + b 0 ) 𝑉 subscript : 1 0 subscript 𝑣 𝑐 superscript subscript 1 0 Softmax superscript subscript : 1 0 𝑎 subscript 𝑏 0 (V\ell_{1:0}+v_{c}\boldsymbol{1}_{0}^{\intercal})\texttt{Softmax}(\ell_{1:0}^{\intercal}a+b_{0}) for notational consistency with t 2 𝑡 2 t\geq 2 . Also, we will define empty sum i = 1 0 a i = 0 superscript subscript 𝑖 1 0 subscript 𝑎 𝑖 0 \sum_{i=1}^{0}a_{i}=0 for any sequence ( a i ) i + subscript subscript 𝑎 𝑖 𝑖 superscript (a_{i})_{i\in\mathbb{N}^{+}} .

Step 1 . Calculating f a 𝑓 𝑎 \frac{\partial f}{\partial a} .

For x [ d ] 𝑥 delimited-[] 𝑑 x\in[d] , we calculate the corresponding directional derivative with the following equation for t 2 𝑡 2 t\geq 2 :

a x t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) subscript 𝑎 𝑥 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 \displaystyle\frac{\partial}{\partial a_{x}}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})
= a x i = 1 t 1 t ( V 1 : t 1 + v c 𝟏 t 1 ) e i exp ( e i ( 1 : t 1 a + b t 1 ) ) s = 1 t 1 exp ( e s ( 1 : t 1 a + b t 1 ) ) absent subscript 𝑎 𝑥 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 subscript 𝑒 𝑖 superscript subscript 𝑒 𝑖 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript subscript 𝑠 1 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 \displaystyle=\frac{\partial}{\partial a_{x}}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})e_{i}\frac{\exp(e_{i}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))}{\sum_{s=1}^{t-1}\exp(e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))}
= i = 1 t 1 t ( V 1 : t 1 + v c 𝟏 t 1 ) e i exp ( e i ( 1 : t 1 a + b t 1 ) ) e i ( 1 : t 1 a + b t 1 ) a x ( s = 1 t 1 exp ( e s ( 1 : t 1 a + b t 1 ) ) ) ( s = 1 t 1 exp ( e s ( 1 : t 1 a + b t 1 ) ) ) 2 absent superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 subscript 𝑒 𝑖 superscript subscript 𝑒 𝑖 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript subscript 𝑒 𝑖 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 subscript 𝑎 𝑥 superscript subscript 𝑠 1 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript superscript subscript 𝑠 1 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 2 \displaystyle=\frac{\sum_{i=1}^{t-1}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})e_{i}\exp(e_{i}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))\frac{\partial e_{i}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1})}{\partial a_{x}}(\sum_{s=1}^{t-1}\exp(e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1})))}{(\sum_{s=1}^{t-1}\exp(e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1})))^{2}}
i = 1 t 1 t ( V 1 : t 1 + v c 𝟏 t 1 ) e i exp ( e i ( 1 : t 1 a + b t 1 ) ) ( s = 1 t 1 exp ( e s ( 1 : t 1 a + b t 1 ) ) e s ( 1 : t 1 a + b t 1 ) a x ) ( s = 1 t 1 exp ( e s ( 1 : t 1 a + b t 1 ) ) ) 2 . superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 subscript 𝑒 𝑖 superscript subscript 𝑒 𝑖 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript subscript 𝑠 1 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 subscript 𝑎 𝑥 superscript superscript subscript 𝑠 1 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 2 \displaystyle\qquad-\frac{\sum_{i=1}^{t-1}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})e_{i}\exp(e_{i}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))\left(\sum_{s=1}^{t-1}\exp(e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))\frac{\partial e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1})}{\partial a_{x}}\right)}{(\sum_{s=1}^{t-1}\exp(e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1})))^{2}}.

Plugging a = 𝟎 d 𝑎 subscript 0 𝑑 a=\boldsymbol{0}_{d} and v c = 𝟎 d subscript 𝑣 𝑐 subscript 0 𝑑 v_{c}=\boldsymbol{0}_{d} , and ( b t = β 𝟏 t ) t [ T 1 ] subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 (b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]} provides

a x subscript 𝑎 𝑥 \displaystyle\frac{\partial}{\partial a_{x}} t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] evaluated-at superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= i = 1 t 1 t V i i x ( t 1 ) i = 1 t 1 t V i ( s = 1 t 1 s x ) ( t 1 ) 2 . absent superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 subscript 𝑖 𝑥 𝑡 1 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑠 1 𝑡 1 subscript 𝑠 𝑥 superscript 𝑡 1 2 \displaystyle=\sum_{i=1}^{t-1}\frac{\ell_{t}^{\intercal}V\ell_{i}\ell_{ix}}{(t-1)}-\sum_{i=1}^{t-1}\frac{\ell_{t}^{\intercal}V\ell_{i}\left(\sum_{s=1}^{t-1}\ell_{sx}\right)}{(t-1)^{2}}.

For t = 1 𝑡 1 t=1 , as t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) = 1 v c superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript subscript 1 subscript 𝑣 𝑐 \ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})=\ell_{1}^{\intercal}v_{c} , a x t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] = 0 evaluated-at subscript 𝑎 𝑥 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 0 \frac{\partial}{\partial a_{x}}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}=0 , so we can use the same formula as t 2 𝑡 2 t\geq 2 with empty sum i = 1 t 1 superscript subscript 𝑖 1 𝑡 1 \sum_{i=1}^{t-1} . Using the above calculation, we can further compute f a x | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] evaluated-at 𝑓 subscript 𝑎 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \frac{\partial f}{\partial a_{x}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}} as follows:

f ( V , a , ( b t ) t [ T 1 ] , v c ) a x | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] evaluated-at 𝑓 𝑉 𝑎 subscript subscript 𝑏 𝑡 𝑡 delimited-[] 𝑇 1 subscript 𝑣 𝑐 subscript 𝑎 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle\frac{\partial f(V,a,(b_{t})_{t\in[T-1]},v_{c})}{\partial a_{x}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= 𝔼 a x ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) 2 | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] absent evaluated-at 𝔼 subscript 𝑎 𝑥 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 2 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle=\mathbb{E}\frac{\partial}{\partial a_{x}}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)^{2}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= 𝔼 [ ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
a x ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] ] \displaystyle\qquad\qquad\frac{\partial}{\partial a_{x}}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}\Bigg{]}
= 𝔼 [ ( t = 1 T t V i = 1 t 1 1 t 1 i + R Π t = 1 T t 2 ) t = 1 T ( i = 1 t 1 t V i i x ( t 1 ) i = 1 t 1 t V i ( s = 1 t 1 s x ) ( t 1 ) 2 ) ] absent 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 superscript subscript 𝑡 𝑉 superscript subscript 𝑖 1 𝑡 1 1 𝑡 1 subscript 𝑖 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 subscript 𝑖 𝑥 𝑡 1 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑠 1 𝑡 1 subscript 𝑠 𝑥 superscript 𝑡 1 2 \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}V\sum_{i=1}^{t-1}\frac{1}{t-1}\ell_{i}+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\sum_{t=1}^{T}\left(\sum_{i=1}^{t-1}\frac{\ell_{t}^{\intercal}V\ell_{i}\ell_{ix}}{(t-1)}-\sum_{i=1}^{t-1}\frac{\ell_{t}^{\intercal}V\ell_{i}\left(\sum_{s=1}^{t-1}\ell_{sx}\right)}{(t-1)^{2}}\right)\Bigg{]} (17)
= 0 , absent 0 \displaystyle=0,

where we used the fact that i subscript 𝑖 \ell_{i} is drawn from a symmetric distribution, and flipping the sign of the variable as i subscript 𝑖 -\ell_{i} yields the same distribution, which leads to the following:

𝔼 [ ( t = 1 T t V i = 1 t 1 1 t 1 i + R Π t = 1 T t 2 ) t = 1 T ( i = 1 t 1 t V i i x ( t 1 ) i = 1 t 1 t V i ( s = 1 t 1 s x ) ( t 1 ) 2 ) ] 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 superscript subscript 𝑡 𝑉 superscript subscript 𝑖 1 𝑡 1 1 𝑡 1 subscript 𝑖 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 subscript 𝑖 𝑥 𝑡 1 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑠 1 𝑡 1 subscript 𝑠 𝑥 superscript 𝑡 1 2 \displaystyle\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}V\sum_{i=1}^{t-1}\frac{1}{t-1}\ell_{i}+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\sum_{t=1}^{T}\left(\sum_{i=1}^{t-1}\frac{\ell_{t}^{\intercal}V\ell_{i}\ell_{ix}}{(t-1)}-\sum_{i=1}^{t-1}\frac{\ell_{t}^{\intercal}V\ell_{i}\left(\sum_{s=1}^{t-1}\ell_{sx}\right)}{(t-1)^{2}}\right)\Bigg{]}
= 𝔼 [ ( t = 1 T t V i = 1 t 1 1 t 1 i + R Π t = 1 T t 2 ) t = 1 T ( i = 1 t 1 t V i i x ( t 1 ) + i = 1 t 1 t V i ( s = 1 t 1 s x ) ( t 1 ) 2 ) ] . absent 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 superscript subscript 𝑡 𝑉 superscript subscript 𝑖 1 𝑡 1 1 𝑡 1 subscript 𝑖 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 subscript 𝑖 𝑥 𝑡 1 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑠 1 𝑡 1 subscript 𝑠 𝑥 superscript 𝑡 1 2 \displaystyle\quad=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}V\sum_{i=1}^{t-1}\frac{1}{t-1}\ell_{i}+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\sum_{t=1}^{T}\left(-\sum_{i=1}^{t-1}\frac{\ell_{t}^{\intercal}V\ell_{i}\ell_{ix}}{(t-1)}+\sum_{i=1}^{t-1}\frac{\ell_{t}^{\intercal}V\ell_{i}\left(\sum_{s=1}^{t-1}\ell_{sx}\right)}{(t-1)^{2}}\right)\Bigg{]}.

This yields Equation 17 =0.

Step 2 . Calculating f v c 𝑓 subscript 𝑣 𝑐 \frac{\partial f}{\partial v_{c}} .

We will use the following equation for t 2 𝑡 2 t\geq 2 :

v c subscript 𝑣 𝑐 \displaystyle\frac{\partial}{\partial v_{c}} t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 \displaystyle\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})
= v c i = 1 t 1 t ( V 1 : t 1 + v c 𝟏 t 1 ) e i exp ( e i ( 1 : t 1 a + b t 1 ) ) s = 1 t 1 exp ( e s ( 1 : t 1 a + b t 1 ) ) = t . absent subscript 𝑣 𝑐 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 subscript 𝑒 𝑖 superscript subscript 𝑒 𝑖 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript subscript 𝑠 1 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 subscript 𝑡 \displaystyle=\frac{\partial}{\partial v_{c}}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})e_{i}\frac{\exp(e_{i}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))}{\sum_{s=1}^{t-1}\exp(e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))}=\ell_{t}.

For t = 1 𝑡 1 t=1 , we define v c 1 ( V 1 : 0 + v c 𝟏 0 ) Softmax ( 1 : 0 a + b 0 ) = 1 subscript 𝑣 𝑐 superscript subscript 1 𝑉 subscript : 1 0 subscript 𝑣 𝑐 superscript subscript 1 0 Softmax superscript subscript : 1 0 𝑎 subscript 𝑏 0 subscript 1 \frac{\partial}{\partial v_{c}}\ell_{1}^{\intercal}(V\ell_{1:0}+v_{c}\boldsymbol{1}_{0}^{\intercal})\texttt{Softmax}(\ell_{1:0}^{\intercal}a+b_{0})=\ell_{1} , so that we can use the same formula as t 2 𝑡 2 t\geq 2 . Therefore, we can calculate f v c | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] evaluated-at 𝑓 subscript 𝑣 𝑐 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \frac{\partial f}{\partial v_{c}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}} as follows:

f ( V , a , ( b t ) t [ T 1 ] , v c ) v c | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] evaluated-at 𝑓 𝑉 𝑎 subscript subscript 𝑏 𝑡 𝑡 delimited-[] 𝑇 1 subscript 𝑣 𝑐 subscript 𝑣 𝑐 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle\frac{\partial f(V,a,(b_{t})_{t\in[T-1]},v_{c})}{\partial v_{c}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= 𝔼 v c ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) 2 | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] absent evaluated-at 𝔼 subscript 𝑣 𝑐 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 2 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle=\mathbb{E}\frac{\partial}{\partial v_{c}}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)^{2}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= 𝔼 [ ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
v c ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] ] \displaystyle\qquad\qquad\frac{\partial}{\partial v_{c}}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}\Bigg{]}
= 𝔼 [ ( t = 2 T t V i = 1 t 1 1 t 1 i + R Π t = 1 T t 2 ) t = 1 T t ] = 0 . absent 𝔼 delimited-[] superscript subscript 𝑡 2 𝑇 superscript subscript 𝑡 𝑉 superscript subscript 𝑖 1 𝑡 1 1 𝑡 1 subscript 𝑖 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 superscript subscript 𝑡 1 𝑇 subscript 𝑡 0 \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=2}^{T}\ell_{t}^{\intercal}V\sum_{i=1}^{t-1}\frac{1}{t-1}\ell_{i}+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\sum_{t=1}^{T}\ell_{t}\Bigg{]}=0.

The last line is due to the same reason as the last part of Step 1.

Step 3 . Calculating f V 𝑓 𝑉 \frac{\partial f}{\partial V} .

We calculate the following equation, which will be used to calculate f V | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] evaluated-at 𝑓 𝑉 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \frac{\partial f}{\partial V}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}} for t 2 𝑡 2 t\geq 2 :

V 𝑉 \displaystyle\frac{\partial}{\partial V} t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] evaluated-at superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= V i = 1 t 1 t ( V 1 : t 1 + v c 𝟏 t 1 ) e i exp ( e i ( 1 : t 1 a + b t 1 ) ) s = 1 t 1 exp ( e s ( 1 : t 1 a + b t 1 ) ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] absent evaluated-at 𝑉 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 subscript 𝑒 𝑖 superscript subscript 𝑒 𝑖 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript subscript 𝑠 1 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle=\frac{\partial}{\partial V}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})e_{i}\frac{\exp(e_{i}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))}{\sum_{s=1}^{t-1}\exp(e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= i = 1 t 1 t i exp ( e i ( 1 : t 1 a + b t 1 ) ) s = 1 t 1 exp ( e s ( 1 : t 1 a + b t 1 ) ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] = i = 1 t 1 1 t 1 t i . absent evaluated-at superscript subscript 𝑖 1 𝑡 1 subscript 𝑡 superscript subscript 𝑖 superscript subscript 𝑒 𝑖 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 superscript subscript 𝑠 1 𝑡 1 superscript subscript 𝑒 𝑠 superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 superscript subscript 𝑖 1 𝑡 1 1 𝑡 1 subscript 𝑡 superscript subscript 𝑖 \displaystyle=\sum_{i=1}^{t-1}\ell_{t}\ell_{i}^{\intercal}\frac{\exp(e_{i}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))}{\sum_{s=1}^{t-1}\exp(e_{s}^{\intercal}(\ell_{1:t-1}^{\intercal}a+b_{t-1}))}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}=\sum_{i=1}^{t-1}\frac{1}{t-1}\ell_{t}\ell_{i}^{\intercal}.

For t = 1 𝑡 1 t=1 , note that V t v c = 𝑶 d × d 𝑉 superscript subscript 𝑡 subscript 𝑣 𝑐 subscript 𝑶 𝑑 𝑑 \frac{\partial}{\partial V}\ell_{t}^{\intercal}v_{c}=\boldsymbol{O}_{d\times d} , so we can use the same formula as t 2 𝑡 2 t\geq 2 with empty sum i = 1 t 1 superscript subscript 𝑖 1 𝑡 1 \sum_{i=1}^{t-1} .

Therefore, we have

f ( V , a , ( b t ) t [ T 1 ] , v c ) V | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] evaluated-at 𝑓 𝑉 𝑎 subscript subscript 𝑏 𝑡 𝑡 delimited-[] 𝑇 1 subscript 𝑣 𝑐 𝑉 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle\frac{\partial f(V,a,(b_{t})_{t\in[T-1]},v_{c})}{\partial V}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= 𝔼 V ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) 2 | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] absent evaluated-at 𝔼 𝑉 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑡 𝑉 subscript : 1 𝑡 1 subscript 𝑣 𝑐 superscript subscript 1 𝑡 1 Softmax superscript subscript : 1 𝑡 1 𝑎 subscript 𝑏 𝑡 1 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 2 formulae-sequence 𝑎 subscript 0 𝑑 subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 \displaystyle=\mathbb{E}\frac{\partial}{\partial V}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)^{2}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
= 𝔼 [ ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}
V ( t = 1 T t ( V 1 : t 1 + v c 𝟏 t 1 ) Softmax ( 1 : t 1 a + b t 1 ) + R Π t = 1 T t 2 ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] ] \displaystyle\qquad\qquad\frac{\partial}{\partial V}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}(V\ell_{1:t-1}+v_{c}\boldsymbol{1}_{t-1}^{\intercal})\texttt{Softmax}(\ell_{1:t-1}^{\intercal}a+b_{t-1})+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]}}\Bigg{]}
= 𝔼 [ ( t = 1 T t V i = 1 t 1 1 t 1 i + R Π t = 1 T t 2 ) t = 1 T i = 1 t 1 1 t 1 t i ] absent 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 superscript subscript 𝑡 𝑉 superscript subscript 𝑖 1 𝑡 1 1 𝑡 1 subscript 𝑖 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 1 𝑡 1 subscript 𝑡 superscript subscript 𝑖 \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\ell_{t}^{\intercal}V\sum_{i=1}^{t-1}\frac{1}{t-1}\ell_{i}+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)\sum_{t=1}^{T}\sum_{i=1}^{t-1}\frac{1}{t-1}\ell_{t}\ell_{i}^{\intercal}\Bigg{]}
= 𝔼 [ ( t = 1 T i = 1 t 1 ( 1 t 1 t V i ) ( 1 t 1 t i ) + R Π T t = 1 T t 2 t i ) ] absent 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 1 𝑡 1 superscript subscript 𝑡 𝑉 subscript 𝑖 1 𝑡 1 subscript 𝑡 superscript subscript 𝑖 subscript 𝑅 Π 𝑇 subscript norm superscript subscript superscript 𝑡 1 𝑇 subscript superscript 𝑡 2 subscript 𝑡 superscript subscript 𝑖 \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\left(\frac{1}{t-1}\ell_{t}^{\intercal}V\ell_{i}\right)\left(\frac{1}{t-1}\ell_{t}\ell_{i}^{\intercal}\right)+R_{\Pi}T\|\sum_{t^{\prime}=1}^{T}\ell_{t^{\prime}}\|_{2}\ell_{t}\ell_{i}^{\intercal}\right)\Bigg{]}
= 𝔼 [ ( t = 1 T i = 1 t 1 x = 1 d y = 1 d v x y t x i y ( 1 t 1 ) 2 [ t z i w ] ( z , w ) + R Π T t = 1 T t 2 t i ) ] absent 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑥 1 𝑑 superscript subscript 𝑦 1 𝑑 subscript 𝑣 𝑥 𝑦 subscript 𝑡 𝑥 subscript 𝑖 𝑦 superscript 1 𝑡 1 2 subscript delimited-[] subscript 𝑡 𝑧 subscript 𝑖 𝑤 𝑧 𝑤 subscript 𝑅 Π 𝑇 subscript norm superscript subscript superscript 𝑡 1 𝑇 subscript superscript 𝑡 2 subscript 𝑡 superscript subscript 𝑖 \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\sum_{x=1}^{d}\sum_{y=1}^{d}v_{xy}\ell_{tx}\ell_{iy}\left(\frac{1}{t-1}\right)^{2}[\ell_{tz}\ell_{iw}]_{(z,w)}+R_{\Pi}T\|\sum_{t^{\prime}=1}^{T}\ell_{t^{\prime}}\|_{2}\ell_{t}\ell_{i}^{\intercal}\right)\Bigg{]}
= t = 1 T i = 1 t 1 x = 1 d y = 1 d 1 ( t 1 ) 2 [ σ x z v x y σ y w ] ( z , w ) + 𝔼 [ R Π T t = 1 T t 2 t i ] absent superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑥 1 𝑑 superscript subscript 𝑦 1 𝑑 1 superscript 𝑡 1 2 subscript delimited-[] subscript 𝜎 𝑥 𝑧 subscript 𝑣 𝑥 𝑦 subscript 𝜎 𝑦 𝑤 𝑧 𝑤 𝔼 delimited-[] subscript 𝑅 Π 𝑇 subscript norm superscript subscript superscript 𝑡 1 𝑇 subscript superscript 𝑡 2 subscript 𝑡 superscript subscript 𝑖 \displaystyle=\sum_{t=1}^{T}\sum_{i=1}^{t-1}\sum_{x=1}^{d}\sum_{y=1}^{d}\frac{1}{(t-1)^{2}}[\sigma_{xz}v_{xy}\sigma_{yw}]_{(z,w)}+\mathbb{E}\Bigg{[}R_{\Pi}T\|\sum_{t^{\prime}=1}^{T}\ell_{t^{\prime}}\|_{2}\ell_{t}\ell_{i}^{\intercal}\Bigg{]}
= ( t = 1 T 1 1 t ) Σ V Σ + 𝔼 [ R Π T t = 1 T t 2 t i ] . absent superscript subscript 𝑡 1 𝑇 1 1 𝑡 Σ 𝑉 Σ 𝔼 delimited-[] subscript 𝑅 Π 𝑇 subscript norm superscript subscript superscript 𝑡 1 𝑇 subscript superscript 𝑡 2 subscript 𝑡 superscript subscript 𝑖 \displaystyle=\left(\sum_{t=1}^{T-1}\frac{1}{t}\right)\Sigma V\Sigma+\mathbb{E}\Bigg{[}R_{\Pi}T\|\sum_{t^{\prime}=1}^{T}\ell_{t^{\prime}}\|_{2}\ell_{t}\ell_{i}^{\intercal}\Bigg{]}.

Therefore, if V = R Π T t = 1 T 1 1 / t Σ 1 𝔼 [ t = 1 T t 2 t i ] Σ 1 superscript 𝑉 subscript 𝑅 Π 𝑇 superscript subscript 𝑡 1 𝑇 1 1 𝑡 superscript Σ 1 𝔼 delimited-[] subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 subscript 𝑡 superscript subscript 𝑖 superscript Σ 1 V^{\star}=R_{\Pi}\frac{T}{\sum_{t=1}^{T-1}1/t}\Sigma^{-1}\mathbb{E}\Bigg{[}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\ell_{t}\ell_{i}^{\intercal}\Bigg{]}\Sigma^{-1} , then f V | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] , V = V = 𝑶 d × d evaluated-at 𝑓 𝑉 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 𝑉 superscript 𝑉 subscript 𝑶 𝑑 𝑑 \frac{\partial f}{\partial V}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]},V=V^{\star}}=\bm{O}_{d\times d} . Lastly, we have

f K | K ( Q c + q c ) = v c = 𝟎 d , V = V = ( f a a K ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] , V = V = 𝑶 d × d evaluated-at 𝑓 𝐾 formulae-sequence superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 subscript 0 𝑑 𝑉 superscript 𝑉 evaluated-at 𝑓 𝑎 𝑎 𝐾 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 𝑉 superscript 𝑉 subscript 𝑶 𝑑 𝑑 \displaystyle\frac{\partial f}{\partial K}\big{|}_{K^{\intercal}(Qc+q_{c})=v_{c}=\boldsymbol{0}_{d},V=V^{\star}}=\left(\frac{\partial f}{\partial a}\frac{\partial a}{\partial K}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]},V=V^{\star}}=\boldsymbol{O}_{d\times d}
f Q | K ( Q c + q c ) = v c = 𝟎 d , V = V = ( f a a Q ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] , V = V = 𝑶 d × d evaluated-at 𝑓 𝑄 formulae-sequence superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 subscript 0 𝑑 𝑉 superscript 𝑉 evaluated-at 𝑓 𝑎 𝑎 𝑄 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 𝑉 superscript 𝑉 subscript 𝑶 𝑑 𝑑 \displaystyle\frac{\partial f}{\partial Q}\big{|}_{K^{\intercal}(Qc+q_{c})=v_{c}=\boldsymbol{0}_{d},V=V^{\star}}=\left(\frac{\partial f}{\partial a}\frac{\partial a}{\partial Q}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]},V=V^{\star}}=\boldsymbol{O}_{d\times d}
f q c | K ( Q c + q c ) = v c = 𝟎 d , V = V = ( f a a q c ) | a = 𝟎 d , v c = 𝟎 d , ( b t = β 𝟏 t ) t [ T 1 ] , V = V = 𝟎 d evaluated-at 𝑓 subscript 𝑞 𝑐 formulae-sequence superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 subscript 0 𝑑 𝑉 superscript 𝑉 evaluated-at 𝑓 𝑎 𝑎 subscript 𝑞 𝑐 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 subscript 0 𝑑 subscript subscript 𝑏 𝑡 𝛽 subscript 1 𝑡 𝑡 delimited-[] 𝑇 1 𝑉 superscript 𝑉 subscript 0 𝑑 \displaystyle\frac{\partial f}{\partial q_{c}}\big{|}_{K^{\intercal}(Qc+q_{c})=v_{c}=\boldsymbol{0}_{d},V=V^{\star}}=\left(\frac{\partial f}{\partial a}\frac{\partial a}{\partial q_{c}}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=\boldsymbol{0}_{d},(b_{t}=\beta\boldsymbol{1}_{t})_{t\in[T-1]},V=V^{\star}}=\boldsymbol{0}_{d}

which means that such configurations are first-order stationary points of Equation 3 with N = 1 𝑁 1 N=1 , h ( x ) = x 2 𝑥 superscript 𝑥 2 h(x)=x^{2} , and Π = B ( 0 , R Π , ) \Pi=B(0,R_{\Pi},\|\cdot\|) . ∎

D.5 Deferred Proof of Theorem 4

See 4

Proof.

The output of the single-layer linear self-attention structure is as follows:

g ( \displaystyle g( Z t ; V , K , Q , v c , k c , q c ) \displaystyle Z_{t};V,K,Q,v_{c},k_{c},q_{c}) (18)
= i = 1 t ( V i i ( K ( Q c + q c ) ) + ( V k c ( Q c + q c ) + v c ( Q c + q c ) K ) i + v c k c ( Q c + q c ) ) , absent superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 𝑉 superscript subscript 𝑘 𝑐 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 superscript 𝑄 𝑐 subscript 𝑞 𝑐 𝐾 subscript 𝑖 subscript 𝑣 𝑐 superscript subscript 𝑘 𝑐 𝑄 𝑐 subscript 𝑞 𝑐 \displaystyle=\sum_{i=1}^{t}\left(V\ell_{i}\ell_{i}^{\intercal}(K^{\intercal}(Qc+q_{c}))+\left(Vk_{c}^{\intercal}(Qc+q_{c})+v_{c}(Qc+q_{c})^{\intercal}K\right)\ell_{i}+v_{c}k_{c}^{\intercal}(Qc+q_{c})\right),

which can be expressed with a larger class

g ( Z t , 𝔸 , β , , δ ) := i = 1 t ( 𝔸 i i β + i + δ ) , assign 𝑔 subscript 𝑍 𝑡 𝔸 𝛽 𝛿 superscript subscript 𝑖 1 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 subscript 𝑖 𝛿 \displaystyle g(Z_{t},\mathbb{A},\beta,\mathbb{C},\delta):=\sum_{i=1}^{t}(\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta+\mathbb{C}\ell_{i}+\delta), (19)

where 𝔸 d × d 𝔸 superscript 𝑑 𝑑 \mathbb{A}\in\mathbb{R}^{d\times d} , β , , δ d 𝛽 𝛿 superscript 𝑑 \beta,\mathbb{C},\delta\in\mathbb{R}^{d} . Then, if a minimizer of

f ( 𝔸 , β , , δ ) : : 𝑓 𝔸 𝛽 𝛿 absent \displaystyle f(\mathbb{A},\beta,\mathbb{C},\delta): = 𝔼 ( t = 1 T t , i = 1 t 1 ( 𝔸 i i β + i + δ ) inf π Π t = 1 T t , π ) 2 absent 𝔼 superscript superscript subscript 𝑡 1 𝑇 subscript 𝑡 superscript subscript 𝑖 1 𝑡 1 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 subscript 𝑖 𝛿 subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝜋 2 \displaystyle=\mathbb{E}\left(\sum_{t=1}^{T}\langle\ell_{t},\sum_{i=1}^{t-1}\left(\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta+\mathbb{C}\ell_{i}+\delta\right)\rangle-\inf_{\pi\in\Pi}\left\langle\sum_{t=1}^{T}\ell_{t},\pi\right\rangle\right)^{2}

can be expressed as 𝔸 = V , β = K ( Q c + q c ) , = V k c ( Q c + q c ) + v c ( Q c + q c ) K , β = v c k c ( Q c + q c ) formulae-sequence 𝔸 𝑉 formulae-sequence 𝛽 superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 formulae-sequence 𝑉 superscript subscript 𝑘 𝑐 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 superscript 𝑄 𝑐 subscript 𝑞 𝑐 𝐾 𝛽 subscript 𝑣 𝑐 superscript subscript 𝑘 𝑐 𝑄 𝑐 subscript 𝑞 𝑐 \mathbb{A}=V,\beta=K^{\intercal}(Qc+q_{c}),\mathbb{C}=Vk_{c}^{\intercal}(Qc+q_{c})+v_{c}(Qc+q_{c})^{\intercal}K,\beta=v_{c}k_{c}^{\intercal}(Qc+q_{c}) , then we can conclude that the corresponding V , Q , K , v c , q c , k c 𝑉 𝑄 𝐾 subscript 𝑣 𝑐 subscript 𝑞 𝑐 subscript 𝑘 𝑐 V,Q,K,v_{c},q_{c},k_{c} are also a minimizer of

𝔼 ( t = 1 T t , g ( Z t 1 ) inf π Π t = 1 T i , π ) 2 , 𝔼 superscript superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝑔 subscript 𝑍 𝑡 1 subscript infimum 𝜋 Π superscript subscript 𝑡 1 𝑇 subscript 𝑖 𝜋 2 \displaystyle\mathbb{E}\left(\sum_{t=1}^{T}\langle\ell_{t},g(Z_{t-1})\rangle-\inf_{\pi\in\Pi}\left\langle\sum_{t=1}^{T}\ell_{i},\pi\right\rangle\right)^{2},

since the corresponding V , Q , K , v c , q c , k c 𝑉 𝑄 𝐾 subscript 𝑣 𝑐 subscript 𝑞 𝑐 subscript 𝑘 𝑐 V,Q,K,v_{c},q_{c},k_{c} constitute a minimizer among a larger class. Now, since Π = B ( 𝟎 d , R Π , ) \Pi=B(\boldsymbol{0}_{d},R_{\Pi},\|\cdot\|) , we can rewrite f 𝑓 f as

f ( 𝔸 , β , , δ ) = 𝔼 ( t = 1 T t , i = 1 t 1 ( 𝔸 i i β + i + δ ) + R Π t = 1 T i 2 ) 2 . 𝑓 𝔸 𝛽 𝛿 𝔼 superscript superscript subscript 𝑡 1 𝑇 subscript 𝑡 superscript subscript 𝑖 1 𝑡 1 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 subscript 𝑖 𝛿 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑖 2 2 \displaystyle f(\mathbb{A},\beta,\mathbb{C},\delta)=\mathbb{E}\left(\sum_{t=1}^{T}\langle\ell_{t},\sum_{i=1}^{t-1}\left(\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta+\mathbb{C}\ell_{i}+\delta\right)\rangle+R_{\Pi}\bigg{\|}\sum_{t=1}^{T}\ell_{i}\bigg{\|}_{2}\right)^{2}. (20)

Step 1 . Finding condition for f δ = 0 𝑓 𝛿 0 \frac{\partial f}{\partial\delta}=0 .

Due to the Leibniz rule, if we calculate the partial derivative of Equation 20 w.r.t. δ 𝛿 \delta , we have

f ( 𝔸 , β , , δ ) δ = δ 𝔼 ( t = 1 T t , i = 1 t 1 ( 𝔸 i i β + i + δ ) + R Π t = 1 T t 2 ) 2 𝑓 𝔸 𝛽 𝛿 𝛿 𝛿 𝔼 superscript superscript subscript 𝑡 1 𝑇 subscript 𝑡 superscript subscript 𝑖 1 𝑡 1 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 subscript 𝑖 𝛿 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 2 \displaystyle\frac{\partial f(\mathbb{A},\beta,\mathbb{C},\delta)}{\partial\delta}=\frac{\partial}{\partial\delta}\mathbb{E}\left(\sum_{t=1}^{T}\langle\ell_{t},\sum_{i=1}^{t-1}\left(\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta+\mathbb{C}\ell_{i}+\delta\right)\rangle+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)^{2}
= 𝔼 δ ( t = 1 T t , i = 1 t 1 ( 𝔸 i i β + i + δ ) + R Π t = 1 T t 2 ) 2 absent 𝔼 𝛿 superscript superscript subscript 𝑡 1 𝑇 subscript 𝑡 superscript subscript 𝑖 1 𝑡 1 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 subscript 𝑖 𝛿 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 2 \displaystyle\qquad=\mathbb{E}\frac{\partial}{\partial\delta}\left(\sum_{t=1}^{T}\langle\ell_{t},\sum_{i=1}^{t-1}\left(\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta+\mathbb{C}\ell_{i}+\delta\right)\rangle+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|_{2}\right)^{2}
= 𝔼 t = 1 T t ( t = 1 T i = 1 t 1 ( t 1 ) t ( 𝔸 i i β + i + δ ) + R Π t = 1 T t ) . absent 𝔼 superscript subscript 𝑡 1 𝑇 subscript 𝑡 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 𝑡 1 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 subscript 𝑖 𝛿 subscript 𝑅 Π norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 \displaystyle\qquad=\mathbb{E}\sum_{t=1}^{T}\ell_{t}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}(t-1)\ell_{t}^{\intercal}\left(\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta+\mathbb{C}\ell_{i}+\delta\right)+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|\right). (21)

Since the expectation of either odd-order polynomial or even-order polynomial times 2 \|\cdot\|_{2} is 0, due to that t subscript 𝑡 \ell_{t} follows a symmetric distribution, we have

𝔼 t = 1 T ( t 1 ) t R Π t = 1 T t 2 = 0 , 𝔼 t = 1 T ( t 1 ) t t = 1 T i = 1 t 1 t i = 0 . formulae-sequence 𝔼 superscript subscript 𝑡 1 𝑇 𝑡 1 subscript 𝑡 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 0 𝔼 superscript subscript 𝑡 1 𝑇 𝑡 1 subscript 𝑡 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 0 \displaystyle\mathbb{E}\sum_{t=1}^{T}(t-1)\ell_{t}R_{\Pi}\Big{\|}\sum_{t=1}^{T}\ell_{t}\Big{\|}_{2}=0,\qquad\mathbb{E}\sum_{t=1}^{T}(t-1)\ell_{t}\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}=0.

Now, we calculate

𝔼 t = 1 T 𝔼 superscript subscript 𝑡 1 𝑇 \displaystyle\mathbb{E}\sum_{t=1}^{T} ( t 1 ) t t = 1 T i = 1 t 1 t 𝔸 i i β = 𝔼 t 1 = 1 T t = 1 T i = 1 t 1 ( t 1 1 ) t 1 t 𝔸 i i β 𝑡 1 subscript 𝑡 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 𝔼 superscript subscript subscript 𝑡 1 1 𝑇 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 subscript 𝑡 1 1 subscript subscript 𝑡 1 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 \displaystyle(t-1)\ell_{t}\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta=\mathbb{E}\sum_{t_{1}=1}^{T}\sum_{t=1}^{T}\sum_{i=1}^{t-1}(t_{1}-1)\ell_{t_{1}}\ell_{t}^{\intercal}\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta
= ( i ) 𝔼 t = 1 T i = 1 t 1 ( t 1 ) t t 𝔸 i i β = 𝔼 t = 1 T ( t 1 ) 2 t t 𝔸 Σ β = 1 6 T ( 2 T 2 3 T + 1 ) Σ 𝔸 Σ β , 𝑖 𝔼 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 𝑡 1 subscript 𝑡 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 𝛽 𝔼 superscript subscript 𝑡 1 𝑇 superscript 𝑡 1 2 subscript 𝑡 superscript subscript 𝑡 𝔸 Σ 𝛽 1 6 𝑇 2 superscript 𝑇 2 3 𝑇 1 Σ 𝔸 Σ 𝛽 \displaystyle\underset{(i)}{=}\mathbb{E}\sum_{t=1}^{T}\sum_{i=1}^{t-1}(t-1)\ell_{t}\ell_{t}^{\intercal}\mathbb{A}\ell_{i}\ell_{i}^{\intercal}\beta=\mathbb{E}\sum_{t=1}^{T}(t-1)^{2}\ell_{t}\ell_{t}^{\intercal}\mathbb{A}\Sigma\beta=\frac{1}{6}T(2T^{2}-3T+1)\Sigma\mathbb{A}\Sigma\beta,

where ( i ) 𝑖 (i) holds since if t 1 t subscript 𝑡 1 𝑡 t_{1}\neq t , due to the independence of t subscript 𝑡 \ell_{t} , t 1 subscript subscript 𝑡 1 \ell_{t_{1}} , we can use 𝔼 t = 0 𝔼 subscript 𝑡 0 \mathbb{E}\ell_{t}=0 . Lastly,

𝔼 t = 1 T 𝔼 superscript subscript 𝑡 1 𝑇 \displaystyle\mathbb{E}\sum_{t=1}^{T} ( t 1 ) t t = 1 T i = 1 t 1 t δ = 𝔼 t 1 = 1 T t = 1 T ( t 1 1 ) ( t 1 ) t 1 t δ = 1 6 T ( 2 T 2 3 T + 1 ) Σ δ . 𝑡 1 subscript 𝑡 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝛿 𝔼 superscript subscript subscript 𝑡 1 1 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝑡 1 1 𝑡 1 subscript subscript 𝑡 1 superscript subscript 𝑡 𝛿 1 6 𝑇 2 superscript 𝑇 2 3 𝑇 1 Σ 𝛿 \displaystyle(t-1)\ell_{t}\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\delta=\mathbb{E}\sum_{t_{1}=1}^{T}\sum_{t=1}^{T}(t_{1}-1)(t-1)\ell_{t_{1}}\ell_{t}^{\intercal}\delta=\frac{1}{6}T(2T^{2}-3T+1)\Sigma\delta.

Plugging the above equations into Equation 21 , we have

f ( 𝔸 , β , , δ ) δ = 1 6 T ( 2 T 2 3 T + 1 ) ( Σ 𝔸 Σ β + Σ δ ) . 𝑓 𝔸 𝛽 𝛿 𝛿 1 6 𝑇 2 superscript 𝑇 2 3 𝑇 1 Σ 𝔸 Σ 𝛽 Σ 𝛿 \displaystyle\frac{\partial f(\mathbb{A},\beta,\mathbb{C},\delta)}{\partial\delta}=\frac{1}{6}T(2T^{2}-3T+1)(\Sigma\mathbb{A}\Sigma\beta+\Sigma\delta).

Due to the optimality condition, we have

𝔸 Σ β + δ = 0 . 𝔸 Σ 𝛽 𝛿 0 \displaystyle\mathbb{A}\Sigma\beta+\delta=0. (22)

Step 2 . Plugging the optimality condition for f δ 𝑓 𝛿 \frac{\partial f}{\partial\delta} into Equation 20 .

Plugging Equation 22 to Equation 20 , f 𝑓 f can be written as

f ( 𝔸 , \displaystyle f(\mathbb{A}, β , , 𝔸 Σ β ) = 𝔼 ( t = 1 T i = 1 t 1 t ( 𝔸 ( i i Σ ) β + i ) + R Π t = 1 T t 2 ) 2 \displaystyle\beta,\mathbb{C},-\mathbb{A}\Sigma\beta)=\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\left(\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta+\mathbb{C}\ell_{i}\right)+R_{\Pi}\bigg{\|}\sum_{t=1}^{T}\ell_{t}\bigg{\|}_{2}\right)^{2}
= 𝔼 ( t = 1 T i = 1 t 1 t 𝔸 ( i i Σ ) β ) 2 ( i ) + 𝔼 ( t = 1 T i = 1 t 1 t i ) 2 + 𝔼 ( R Π t = 1 T t 2 ) 2 absent subscript 𝔼 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 2 𝑖 𝔼 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 2 𝔼 superscript subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 2 \displaystyle=\underbrace{\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta\right)^{2}}_{(i)}+\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}\right)^{2}+\mathbb{E}\left(R_{\Pi}\bigg{\|}\sum_{t=1}^{T}\ell_{t}\bigg{\|}_{2}\right)^{2}
+ 2 𝔼 ( t = 1 T i = 1 t 1 t 𝔸 ( i i Σ ) β ) ( t = 1 T i = 1 t 1 t i ) ( i i ) subscript 2 𝔼 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 𝑖 𝑖 \displaystyle\quad+\underbrace{2\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta\right)\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}\right)}_{(ii)}
+ 2 𝔼 ( t = 1 T i = 1 t 1 t 𝔸 ( i i Σ ) β ) ( R Π t = 1 T t 2 ) ( i i i ) subscript 2 𝔼 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 𝑖 𝑖 𝑖 \displaystyle\qquad+\underbrace{2\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta\right)\left(R_{\Pi}\bigg{\|}\sum_{t=1}^{T}\ell_{t}\bigg{\|}_{2}\right)}_{(iii)}
+ 2 𝔼 ( t = 1 T i = 1 t 1 t i ) ( R Π t = 1 T t 2 ) . 2 𝔼 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 subscript 𝑅 Π subscript norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 \displaystyle\qquad\quad+2\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}\right)\left(R_{\Pi}\bigg{\|}\sum_{t=1}^{T}\ell_{t}\bigg{\|}_{2}\right).

For the part ( i ) 𝑖 (i) , we have

𝔼 𝔼 \displaystyle\mathbb{E} ( t = 1 T i = 1 t 1 t 𝔸 ( i i Σ ) β ) 2 = 𝔼 [ t 1 = 1 T i 1 = 1 t 1 1 t = 1 T i = 1 t 1 β ( i 1 i 1 Σ ) 𝔸 t 1 t 𝔸 ( i i Σ ) β ] superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 2 𝔼 delimited-[] superscript subscript subscript 𝑡 1 1 𝑇 superscript subscript subscript 𝑖 1 1 subscript 𝑡 1 1 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript 𝛽 subscript subscript 𝑖 1 superscript subscript subscript 𝑖 1 Σ superscript 𝔸 subscript subscript 𝑡 1 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 \displaystyle\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta\right)^{2}=\mathbb{E}\left[\sum_{t_{1}=1}^{T}\sum_{i_{1}=1}^{t_{1}-1}\sum_{t=1}^{T}\sum_{i=1}^{t-1}\beta^{\intercal}(\ell_{i_{1}}\ell_{i_{1}}^{\intercal}-\Sigma)\mathbb{A}^{\intercal}\ell_{t_{1}}\ell_{t}^{\intercal}\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta\right]
= ( 1 ) 𝔼 [ t = 1 T i 1 = 1 t 1 i = 1 t 1 β ( i 1 i 1 Σ ) 𝔸 t t 𝔸 ( i i Σ ) β ] 1 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 superscript subscript subscript 𝑖 1 1 𝑡 1 superscript subscript 𝑖 1 𝑡 1 superscript 𝛽 subscript subscript 𝑖 1 superscript subscript subscript 𝑖 1 Σ superscript 𝔸 subscript 𝑡 superscript subscript 𝑡 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 \displaystyle\qquad\underset{(1)}{=}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i_{1}=1}^{t-1}\sum_{i=1}^{t-1}\beta^{\intercal}(\ell_{i_{1}}\ell_{i_{1}}^{\intercal}-\Sigma)\mathbb{A}^{\intercal}\ell_{t}\ell_{t}^{\intercal}\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta\right]
= ( 2 ) 𝔼 [ t = 1 T i = 1 t 1 β ( i i Σ ) 𝔸 i i 𝔸 ( i i Σ ) β ] 2 𝔼 delimited-[] superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript 𝛽 subscript 𝑖 superscript subscript 𝑖 Σ superscript 𝔸 subscript 𝑖 superscript subscript 𝑖 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 \displaystyle\qquad\underset{(2)}{=}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i=1}^{t-1}\beta^{\intercal}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\mathbb{A}^{\intercal}\ell_{i}\ell_{i}^{\intercal}\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta\right]
= ( T 1 ) T 2 β 𝔼 [ ( i i Σ ) 𝔸 Σ 𝔸 ( i i Σ ) ] β absent 𝑇 1 𝑇 2 superscript 𝛽 𝔼 delimited-[] subscript 𝑖 superscript subscript 𝑖 Σ superscript 𝔸 Σ 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 \displaystyle\qquad=\frac{(T-1)T}{2}\beta^{\intercal}\mathbb{E}\left[(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\mathbb{A}^{\intercal}\Sigma\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\right]\beta (23)
= ( T 1 ) T 2 β 𝔼 [ ( Σ A ( i i Σ ) ) ( Σ A ( i i Σ ) ) ] β . absent 𝑇 1 𝑇 2 superscript 𝛽 𝔼 delimited-[] superscript Σ 𝐴 subscript 𝑖 superscript subscript 𝑖 Σ Σ 𝐴 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 \displaystyle\qquad=\frac{(T-1)T}{2}\beta^{\intercal}\mathbb{E}\left[(\sqrt{\Sigma}A(\ell_{i}\ell_{i}^{\intercal}-\Sigma))^{\intercal}(\sqrt{\Sigma}A(\ell_{i}\ell_{i}^{\intercal}-\Sigma))\right]\beta.

Here, ( 1 ) 1 (1) holds because if t 1 t subscript 𝑡 1 𝑡 t_{1}\neq t , we know that 𝔼 t 1 = 𝔼 t = 0 𝔼 subscript subscript 𝑡 1 𝔼 subscript 𝑡 0 \mathbb{E}\ell_{t_{1}}=\mathbb{E}\ell_{t}=0 , and they are independent, and ( 2 ) 2 (2) holds because if i 1 i subscript 𝑖 1 𝑖 i_{1}\neq i , we can calculate 𝔼 ( i 1 i 1 Σ ) = 𝑶 d × d 𝔼 subscript subscript 𝑖 1 superscript subscript subscript 𝑖 1 Σ subscript 𝑶 𝑑 𝑑 \mathbb{E}(\ell_{i_{1}}\ell_{i_{1}}^{\intercal}-\Sigma)=\bm{O}_{d\times d} . In addition, we can easily check that ( i i ) 𝑖 𝑖 (ii) and ( i i i ) 𝑖 𝑖 𝑖 (iii) are 0 0 as they are polynomials of odd degrees and we have Z = 𝑑 Z 𝑍 𝑑 𝑍 Z\overset{d}{=}-Z . Note that Equation 23 is minimized when ( Σ 𝔸 ( i i Σ ) β = 𝟎 d ) = 1 Σ 𝔸 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 subscript 0 𝑑 1 \mathbb{P}(\sqrt{\Sigma}\mathbb{A}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta=\boldsymbol{0}_{d})=1 .

If 𝔸 𝑶 d × d 𝔸 subscript 𝑶 𝑑 𝑑 \mathbb{A}\neq\bm{O}_{d\times d} , suppose that the singular value decomposition of A = U Λ V 𝐴 𝑈 Λ 𝑉 A=U\Lambda V yields that Λ Λ \Lambda is a diagonal matrix whose first diagonal element is non-zero, and U , V 𝑈 𝑉 U,V are orthogonal matrices. Then, we want to find β 𝛽 \beta that Σ U Λ V ( i i Σ ) β = 𝟎 d Σ 𝑈 Λ 𝑉 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 subscript 0 𝑑 \sqrt{\Sigma}U\Lambda V(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta=\boldsymbol{0}_{d} for any i subscript 𝑖 \ell_{i} such that p ( i ) 0 𝑝 subscript 𝑖 0 p(\ell_{i})\neq 0 , where p 𝑝 p indicates the probability density function of loss vectors. Since Σ Σ \Sigma and U 𝑈 U are invertible, we only need to consider Λ V ( i i Σ ) β = 𝟎 d Λ 𝑉 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 subscript 0 𝑑 \Lambda V(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta=\boldsymbol{0}_{d} . Since Λ Λ \Lambda ’s first diagonal component is non-zero, we will consider equation e 1 Λ V ( i i Σ ) β = 0 superscript subscript 𝑒 1 Λ 𝑉 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 0 e_{1}^{\intercal}\Lambda V(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta=0 . This is equivalent to V 1 ( i i Σ ) β = 0 subscript 𝑉 1 subscript 𝑖 superscript subscript 𝑖 Σ 𝛽 0 V_{1}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\beta=0 , where V 1 subscript 𝑉 1 V_{1} is the first row of V 𝑉 V , and is a non-zero vector.

Now, we will generally consider a x , y ( v ) := v v x y assign subscript 𝑎 𝑥 𝑦 𝑣 𝑣 superscript 𝑣 𝑥 𝑦 a_{x,y}(v):=vv^{\intercal}x-y where x , y , v d 𝑥 𝑦 𝑣 superscript 𝑑 x,y,v\in\mathbb{R}^{d} and a x , y : B ( 𝟎 d , 2 ϵ 1 , ) d a_{x,y}:B(\boldsymbol{0}_{d},2\epsilon_{1},\|\cdot\|)\to\mathbb{R}^{d} function. Then, we can check that the Jacobian of a x , y ( v ) subscript 𝑎 𝑥 𝑦 𝑣 a_{x,y}(v) is v x + ( v x ) I 𝑣 superscript 𝑥 𝑣 𝑥 𝐼 vx^{\intercal}+(v\cdot x)I , and we can find that the determinant of the Jacobian is nonzero when v = ϵ 1 x 𝑣 subscript italic-ϵ 1 𝑥 v=\epsilon_{1}x if x 𝟎 d 𝑥 subscript 0 𝑑 x\neq\boldsymbol{0}_{d} . Therefore, the volume of ( V 1 ( i i Σ ) ) subscript 𝑉 1 subscript 𝑖 superscript subscript 𝑖 Σ (V_{1}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)) for i B ( 𝟎 d , c z , ) \ell_{i}\in B(\boldsymbol{0}_{d},c_{z},\|\cdot\|) is greater than the volume of ( V 1 ( v v Σ ) ) subscript 𝑉 1 𝑣 superscript 𝑣 Σ (V_{1}(vv^{\intercal}-\Sigma)) for v B ( ϵ 1 V 1 , ϵ 2 , ) v\in B(\epsilon_{1}V_{1}^{\intercal},\epsilon_{2},\|\cdot\|) , where c z subscript 𝑐 𝑧 c_{z} is a constant such that B ( 𝟎 d , c z , ) supp ( Z ) B(\boldsymbol{0}_{d},c_{z},\|\cdot\|)\subseteq\text{supp}(Z) , and ϵ 1 , ϵ 2 > 0 subscript italic-ϵ 1 subscript italic-ϵ 2 0 \epsilon_{1},\epsilon_{2}>0 satisfy that ϵ 1 | V 1 | + ϵ 2 < c z subscript italic-ϵ 1 subscript 𝑉 1 subscript italic-ϵ 2 subscript 𝑐 𝑧 \epsilon_{1}|V_{1}|+\epsilon_{2}<c_{z} . Here, we define ϵ 2 > 0 subscript italic-ϵ 2 0 \epsilon_{2}>0 sufficiently small so that the determinant of Jacobian ( v v V 1 Σ V 1 ) > 0 Jacobian 𝑣 superscript 𝑣 superscript subscript 𝑉 1 Σ superscript subscript 𝑉 1 0 \text{Jacobian}(vv^{\intercal}V_{1}^{\intercal}-\Sigma V_{1}^{\intercal})>0 for v B ( ϵ 1 V 1 , ϵ 2 , ) v\in B(\epsilon_{1}V_{1}^{\intercal},\epsilon_{2},\|\cdot\|) , and v v v V 1 Σ V 1 𝑣 𝑣 superscript 𝑣 superscript subscript 𝑉 1 Σ superscript subscript 𝑉 1 v\to vv^{\intercal}V_{1}^{\intercal}-\Sigma V_{1}^{\intercal} is a one-to-one correspondence, by inverse function theorem. Therefore, the volume of ( V 1 ( v v Σ ) ) subscript 𝑉 1 𝑣 superscript 𝑣 Σ (V_{1}(vv^{\intercal}-\Sigma)) for v B ( ϵ 1 V 1 , ϵ 2 , ) v\in B(\epsilon_{1}V_{1}^{\intercal},\epsilon_{2},\|\cdot\|) can be calculated as

[ Volume ( V 1 ( v v Σ ) ) for v B ( ϵ 1 V 1 , ϵ 2 , ) ] = v B ( ϵ 1 V 1 , ϵ 2 , ) | det(Jacobian ( V 1 ( v v Σ ) ) ) | d v > 0 . \displaystyle[\text{Volume $(V_{1}(vv^{\intercal}-\Sigma))$ for $v\in B(\epsilon_{1}V_{1}^{\intercal},\epsilon_{2},\|\cdot\|)$}]=\int_{v\in B(\epsilon_{1}V_{1}^{\intercal},\epsilon_{2},\|\cdot\|)}\big{|}\text{det(Jacobian}(V_{1}(vv^{\intercal}-\Sigma)))\big{|}dv>0.

Therefore, Volume ( V 1 ( v v Σ ) ) Volume subscript 𝑉 1 𝑣 superscript 𝑣 Σ \text{Volume}(V_{1}(vv^{\intercal}-\Sigma)) where v B ( ϵ 1 V 1 , ϵ 2 , ) v\in B(\epsilon_{1}V_{1}^{\intercal},\epsilon_{2},\|\cdot\|) is non-zero, so that we can find d 𝑑 d loss vectors { i } i [ d ] subscript subscript 𝑖 𝑖 delimited-[] 𝑑 \{\ell_{i}\}_{i\in[d]} such that the vectors { V 1 ( i i Σ ) } i [ d ] subscript subscript 𝑉 1 subscript 𝑖 superscript subscript 𝑖 Σ 𝑖 delimited-[] 𝑑 \{V_{1}(\ell_{i}\ell_{i}^{\intercal}-\Sigma)\}_{i\in[d]} are linearly independent. Hence, if we want to minimize Equation 23 , either A = 𝑶 d × d 𝐴 subscript 𝑶 𝑑 𝑑 A=\bm{O}_{d\times d} or β = 𝟎 d 𝛽 subscript 0 𝑑 \beta=\boldsymbol{0}_{d} should hold. In both cases, Equation 19 can be re-written as

g ( Z t ; 𝔸 , β , , δ ) := i = 1 t i , assign 𝑔 subscript 𝑍 𝑡 𝔸 𝛽 𝛿 superscript subscript 𝑖 1 𝑡 subscript 𝑖 \displaystyle g(Z_{t};\mathbb{A},\beta,\mathbb{C},\delta):=\sum_{i=1}^{t}\mathbb{C}\ell_{i},

and this is covered by the original parametrization ( Equation 18 ) with K ( Q c + q c ) = v c = 𝟎 d superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 subscript 0 𝑑 K^{\intercal}(Qc+q_{c})=v_{c}=\boldsymbol{0}_{d} .

Step 3 . Calculating f 𝑓 \frac{\partial f}{\partial\mathbb{C}} .

Now, we optimize over \mathbb{C} , by minimizing the following objective:

f ( ) : : 𝑓 absent \displaystyle f(\mathbb{C}): = 𝔼 ( t = 1 T i = 1 t 1 t i + R Π t = 1 T t ) 2 absent 𝔼 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 subscript 𝑅 Π norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 \displaystyle=\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}+R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|\right)^{2}
= 𝔼 ( t = 1 T i = 1 t 1 t i ) 2 ( i ) + 2 𝔼 ( ( t = 1 T i = 1 t 1 t i ) R Π t = 1 T t ) + 𝔼 ( R Π t = 1 T t ) 2 absent subscript 𝔼 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 2 𝑖 2 𝔼 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 subscript 𝑅 Π norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝔼 superscript subscript 𝑅 Π norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 \displaystyle=\underbrace{\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}\right)^{2}}_{(i)}+{2\mathbb{E}\left(\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}\right)R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|\right)}+\mathbb{E}\left(R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|\right)^{2}
= T ( T 1 ) 2 Tr ( Σ Σ ) + 2 𝔼 ( B t = 1 T i = 1 t 1 t i j = 1 T j ) + 𝔼 ( R Π t = 1 T t ) 2 . absent 𝑇 𝑇 1 2 Tr superscript Σ Σ 2 𝔼 𝐵 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 norm superscript subscript 𝑗 1 𝑇 subscript 𝑗 𝔼 superscript subscript 𝑅 Π norm superscript subscript 𝑡 1 𝑇 subscript 𝑡 2 \displaystyle=\frac{T(T-1)}{2}\mathop{\mathrm{Tr}}\left(\mathbb{C}^{\intercal}\Sigma\mathbb{C}\Sigma\right)+{2\mathbb{E}\left(B\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}\|\sum_{j=1}^{T}\ell_{j}\|\right)}+\mathbb{E}\left(R_{\Pi}\|\sum_{t=1}^{T}\ell_{t}\|\right)^{2}.

Here, ( i ) 𝑖 (i) can be calculated as follows:

𝔼 𝔼 \displaystyle\mathbb{E} ( t = 1 T i = 1 t 1 t i ) 2 = 𝔼 ( t 1 = 1 T i 1 = 1 t 1 1 t = 1 T i = 1 t 1 i 1 t 1 i i ) superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑡 subscript 𝑖 2 𝔼 superscript subscript subscript 𝑡 1 1 𝑇 superscript subscript subscript 𝑖 1 1 subscript 𝑡 1 1 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript subscript 𝑖 1 superscript subscript subscript 𝑡 1 superscript subscript 𝑖 subscript 𝑖 \displaystyle\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}^{\intercal}\mathbb{C}\ell_{i}\right)^{2}=\mathbb{E}\left(\sum_{t_{1}=1}^{T}\sum_{i_{1}=1}^{t_{1}-1}\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{i_{1}}^{\intercal}\mathbb{C}^{\intercal}\ell_{t_{1}}\ell_{i}^{\intercal}\mathbb{C}\ell_{i}\right)
= ( 1 ) 𝔼 ( t = 1 T i 1 = 1 i 1 i = 1 t 1 i 1 i i i ) = 𝔼 ( t = 1 T i 1 = 1 i 1 i = 1 t 1 i 1 Σ i ) 1 𝔼 superscript subscript 𝑡 1 𝑇 superscript subscript subscript 𝑖 1 1 𝑖 1 superscript subscript 𝑖 1 𝑡 1 superscript subscript subscript 𝑖 1 superscript subscript 𝑖 superscript subscript 𝑖 subscript 𝑖 𝔼 superscript subscript 𝑡 1 𝑇 superscript subscript subscript 𝑖 1 1 𝑖 1 superscript subscript 𝑖 1 𝑡 1 superscript subscript subscript 𝑖 1 superscript Σ subscript 𝑖 \displaystyle\underset{(1)}{=}\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i_{1}=1}^{i-1}\sum_{i=1}^{t-1}\ell_{i_{1}}^{\intercal}\mathbb{C}^{\intercal}\ell_{i}\ell_{i}^{\intercal}\mathbb{C}\ell_{i}\right)=\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i_{1}=1}^{i-1}\sum_{i=1}^{t-1}\ell_{i_{1}}^{\intercal}\mathbb{C}^{\intercal}\Sigma\mathbb{C}\ell_{i}\right)
= ( 2 ) 𝔼 ( t = 1 T i = 1 t 1 k Σ i ) = ( 3 ) 𝔼 Tr ( t = 1 T i = 1 t 1 Σ i k ) = T ( T 1 ) 2 Tr ( Σ Σ ) , 2 𝔼 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript subscript 𝑘 superscript Σ subscript 𝑖 3 𝔼 Tr superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 superscript Σ subscript 𝑖 superscript subscript 𝑘 𝑇 𝑇 1 2 Tr superscript Σ Σ \displaystyle\underset{(2)}{=}\mathbb{E}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{k}^{\intercal}\mathbb{C}^{\intercal}\Sigma\mathbb{C}\ell_{i}\right)\underset{(3)}{=}\mathbb{E}\mathop{\mathrm{Tr}}\left(\sum_{t=1}^{T}\sum_{i=1}^{t-1}\mathbb{C}^{\intercal}\Sigma\mathbb{C}\ell_{i}\ell_{k}^{\intercal}\right)=\frac{T(T-1)}{2}\mathop{\mathrm{Tr}}\left(\mathbb{C}^{\intercal}\Sigma\mathbb{C}\Sigma\right),

since (1) holds because if t 1 t subscript 𝑡 1 𝑡 t_{1}\neq t , we already know that 𝔼 t = 𝔼 t 1 = 0 𝔼 subscript 𝑡 𝔼 subscript subscript 𝑡 1 0 \mathbb{E}\ell_{t}=\mathbb{E}\ell_{t_{1}}=0 , (2) holds due to a similar reason, and (3) comes from Tr ( A B ) = Tr ( B A ) Tr 𝐴 𝐵 Tr 𝐵 𝐴 \mathop{\mathrm{Tr}}(AB)=\mathop{\mathrm{Tr}}(BA) .

We calculate f ( ) 𝑓 \frac{\partial f(\mathbb{C})}{\partial\mathbb{C}} :

f ( ) = T ( T 1 ) Σ Σ + 2 R Π 𝔼 ( j = 1 T j t = 1 T i = 1 t 1 t i ) . 𝑓 𝑇 𝑇 1 Σ Σ 2 subscript 𝑅 Π 𝔼 norm superscript subscript 𝑗 1 𝑇 subscript 𝑗 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 subscript 𝑡 superscript subscript 𝑖 \displaystyle\frac{\partial f(\mathbb{C})}{\partial\mathbb{C}}=T(T-1)\Sigma\mathbb{C}\Sigma+2R_{\Pi}\mathbb{E}\left(\|\sum_{j=1}^{T}\ell_{j}\|\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}\ell_{i}^{\intercal}\right).

Hence, the optimal = 2 R Π T ( T 1 ) Σ 1 𝔼 ( j = 1 T j t = 1 T i = 1 t 1 t i ) Σ 1 . 2 subscript 𝑅 Π 𝑇 𝑇 1 superscript Σ 1 𝔼 norm superscript subscript 𝑗 1 𝑇 subscript 𝑗 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑖 1 𝑡 1 subscript 𝑡 superscript subscript 𝑖 superscript Σ 1 \mathbb{C}=-\frac{2R_{\Pi}}{T(T-1)}\Sigma^{-1}\mathbb{E}\left(\|\sum_{j=1}^{T}\ell_{j}\|\sum_{t=1}^{T}\sum_{i=1}^{t-1}\ell_{t}\ell_{i}^{\intercal}\right)\Sigma^{-1}.

Now, we see that for the special case of Σ = I Σ 𝐼 \Sigma=I , we have = R Π 𝔼 ( j = 1 T j t i ) subscript 𝑅 Π 𝔼 norm superscript subscript 𝑗 1 𝑇 subscript 𝑗 subscript 𝑡 superscript subscript 𝑖 \mathbb{C}=-R_{\Pi}\mathbb{E}\left(\|\sum_{j=1}^{T}\ell_{j}\|\ell_{t}\ell_{i}^{\intercal}\right) . If we calculate the ( a , b ) 𝑎 𝑏 (a,b) -coordinate of \mathbb{C} , we need to calculate

𝔼 [ o = 1 d ( s = 1 T s o ) 2 i a k b ] . subscript 𝔼 delimited-[] superscript subscript 𝑜 1 𝑑 superscript superscript subscript 𝑠 1 𝑇 subscript 𝑠 𝑜 2 subscript 𝑖 𝑎 subscript 𝑘 𝑏 \displaystyle\mathbb{E}_{\ell}\left[\sqrt{\sum_{o=1}^{d}(\sum_{s=1}^{T}\ell_{so})^{2}}\ell_{ia}\ell_{kb}\right].

If a b 𝑎 𝑏 a\neq b , then since Z 𝑍 Z is symmetric, the term above becomes zero. Therefore, we only need to consider the case when a = b 𝑎 𝑏 a=b , which is 𝔼 [ o = 1 d ( s = 1 T s o ) 2 i a k a ] subscript 𝔼 delimited-[] superscript subscript 𝑜 1 𝑑 superscript superscript subscript 𝑠 1 𝑇 subscript 𝑠 𝑜 2 subscript 𝑖 𝑎 subscript 𝑘 𝑎 \mathbb{E}_{\ell}\left[\sqrt{\sum_{o=1}^{d}(\sum_{s=1}^{T}\ell_{so})^{2}}\ell_{ia}\ell_{ka}\right] , and it will be the same value for all a [ d ] 𝑎 delimited-[] 𝑑 a\in[d] since i subscript 𝑖 \ell_{i} ’s coordinates are independent.

Now, we calculate the scale of 𝔼 [ o = 1 d ( s = 1 T s o ) 2 i 1 k 1 ] subscript 𝔼 delimited-[] superscript subscript 𝑜 1 𝑑 superscript superscript subscript 𝑠 1 𝑇 subscript 𝑠 𝑜 2 subscript 𝑖 1 subscript 𝑘 1 \mathbb{E}_{\ell}\left[\sqrt{\sum_{o=1}^{d}(\sum_{s=1}^{T}\ell_{so})^{2}}\ell_{i1}\ell_{k1}\right] . We have Z := o = 1 d 1 ( s = 1 T s o ) 2 T ( d 1 ) a . s . 1 Z:=\frac{\sum_{o=1}^{d-1}(\sum_{s=1}^{T}\ell_{so})^{2}}{T(d-1)}\overset{a.s.}{\to}1 as d 𝑑 d\to\infty (by the law of large numbers) and we define W := s i , k s 1 / T assign 𝑊 subscript 𝑠 𝑖 𝑘 subscript 𝑠 1 𝑇 W:=\sum_{s\neq i,k}\ell_{s1}/\sqrt{T} which is independent of i 1 subscript 𝑖 1 \ell_{i1} and k 1 subscript 𝑘 1 \ell_{k1} .

𝔼 subscript 𝔼 \displaystyle\mathbb{E}_{\ell} [ o = 1 d ( s = 1 T s o ) 2 i 1 k 1 ] = 𝔼 Z , W , i 1 , k 1 [ T ( d 1 ) Z + ( T W + i 1 + k 1 ) 2 i 1 k 1 ] delimited-[] superscript subscript 𝑜 1 𝑑 superscript superscript subscript 𝑠 1 𝑇 subscript 𝑠 𝑜 2 subscript 𝑖 1 subscript 𝑘 1 subscript 𝔼 𝑍 𝑊 subscript 𝑖 1 subscript 𝑘 1 delimited-[] 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 subscript 𝑖 1 subscript 𝑘 1 \displaystyle\left[\sqrt{\sum_{o=1}^{d}(\sum_{s=1}^{T}\ell_{so})^{2}}\ell_{i1}\ell_{k1}\right]=\mathbb{E}_{Z,W,\ell_{i1},\ell_{k1}}\left[\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}+\ell_{k1})^{2}}\ell_{i1}\ell_{k1}\right]
= 𝔼 Z , W , i 1 , k 1 0 [ T ( d 1 ) Z + ( T W + i 1 + k 1 ) 2 i 1 k 1 T ( d 1 ) Z + ( T W + i 1 k 1 ) 2 i 1 k 1 ] absent subscript 𝔼 𝑍 𝑊 subscript 𝑖 1 subscript 𝑘 1 0 delimited-[] 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 subscript 𝑖 1 subscript 𝑘 1 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 subscript 𝑖 1 subscript 𝑘 1 \displaystyle=\mathbb{E}_{Z,W,\ell_{i1},\ell_{k1}\geq 0}\left[\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}+\ell_{k1})^{2}}\ell_{i1}\ell_{k1}-\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}-\ell_{k1})^{2}}\ell_{i1}\ell_{k1}\right]
= 𝔼 Z , W , i 1 , k 1 0 [ 4 ( T W + i 1 ) k 1 T ( d 1 ) Z + ( T W + i 1 + k 1 ) 2 + T ( d 1 ) Z + ( T W + i 1 k 1 ) 2 i 1 k 1 ] . absent subscript 𝔼 𝑍 𝑊 subscript 𝑖 1 subscript 𝑘 1 0 delimited-[] 4 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 subscript 𝑖 1 subscript 𝑘 1 \displaystyle=\mathbb{E}_{Z,W,\ell_{i1},\ell_{k1}\geq 0}\left[\frac{4(\sqrt{T}W+\ell_{i1})\ell_{k1}}{\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}+\ell_{k1})^{2}}+\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}-\ell_{k1})^{2}}}\ell_{i1}\ell_{k1}\right].

Taking d 𝑑 d\to\infty , we have

T ( d 1 ) Z + ( T W + i 1 + k 1 ) 2 + T ( d 1 ) Z + ( T W + i 1 k 1 ) 2 2 T d 𝑑 1 , 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 2 𝑇 𝑑 𝑑 1 \displaystyle\frac{\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}+\ell_{k1})^{2}}+\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}-\ell_{k1})^{2}}}{2\sqrt{Td}}\overset{d}{\to}1,

which further implies

T d 4 ( T W + i 1 ) k 1 T ( d 1 ) Z + ( T W + i 1 + k 1 ) 2 + T ( d 1 ) Z + ( T W + i 1 k 1 ) 2 i 1 k 1 𝑇 𝑑 4 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 subscript 𝑖 1 subscript 𝑘 1 \displaystyle\sqrt{Td}\frac{4(\sqrt{T}W+\ell_{i1})\ell_{k1}}{\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}+\ell_{k1})^{2}}+\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}-\ell_{k1})^{2}}}\ell_{i1}\ell_{k1}
𝑑 T d 4 ( T W + i 1 ) k 1 2 T d i 1 k 1 = 2 ( T W + i 1 ) i 1 k 1 𝑑 𝑇 𝑑 4 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 𝑇 𝑑 subscript 𝑖 1 subscript 𝑘 1 2 𝑇 𝑊 subscript 𝑖 1 subscript 𝑖 1 subscript 𝑘 1 \displaystyle\qquad\overset{d}{\to}\sqrt{Td}\frac{4(\sqrt{T}W+\ell_{i1})\ell_{k1}}{2\sqrt{Td}}\ell_{i1}\ell_{k1}=2(\sqrt{T}W+\ell_{i1})\ell_{i1}\ell_{k1}

as d 𝑑 d\to\infty . Therefore,

lim d 𝔼 Z , W , i 1 , k 1 0 [ T d 4 ( T W + i 1 ) k 1 T ( d 1 ) Z + ( T W + i 1 + k 1 ) 2 + T ( d 1 ) Z + ( T W + i 1 k 1 ) 2 i 1 k 1 ] subscript 𝑑 subscript 𝔼 𝑍 𝑊 subscript 𝑖 1 subscript 𝑘 1 0 delimited-[] 𝑇 𝑑 4 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 𝑇 𝑑 1 𝑍 superscript 𝑇 𝑊 subscript 𝑖 1 subscript 𝑘 1 2 subscript 𝑖 1 subscript 𝑘 1 \displaystyle\lim_{d\to\infty}\mathbb{E}_{Z,W,\ell_{i1},\ell_{k1}\geq 0}\left[{\sqrt{Td}}\frac{4(\sqrt{T}W+\ell_{i1})\ell_{k1}}{\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}+\ell_{k1})^{2}}+\sqrt{T(d-1)Z+(\sqrt{T}W+\ell_{i1}-\ell_{k1})^{2}}}\ell_{i1}\ell_{k1}\right]
= 𝔼 Z , W , i 1 , k 1 0 [ 2 ( T W + i 1 ) i 1 k 1 ] = 𝔼 i 1 , k 1 0 [ i 1 2 k 1 ] absent subscript 𝔼 𝑍 𝑊 subscript 𝑖 1 subscript 𝑘 1 0 delimited-[] 2 𝑇 𝑊 subscript 𝑖 1 subscript 𝑖 1 subscript 𝑘 1 subscript 𝔼 subscript 𝑖 1 subscript 𝑘 1 0 delimited-[] superscript subscript 𝑖 1 2 subscript 𝑘 1 \displaystyle=\mathbb{E}_{Z,W,\ell_{i1},\ell_{k1}\geq 0}\left[2(\sqrt{T}W+\ell_{i1})\ell_{i1}\ell_{k1}\right]=\mathbb{E}_{\ell_{i1},\ell_{k1}\geq 0}\left[\ell_{i1}^{2}\ell_{k1}\right]

which is a constant. The last equality came from the fact that W 𝑊 W , i 1 subscript 𝑖 1 \ell_{i1} , k 1 subscript 𝑘 1 \ell_{k1} are independent random variables, and expectation of i 1 subscript 𝑖 1 \ell_{i1} is zero. Therefore, the output of the single-layer linear self-attention provides us with online gradient descent with step-size Θ ( R Π / T d ) Θ subscript 𝑅 Π 𝑇 𝑑 \Theta(R_{\Pi}/\sqrt{Td}) . In the online learning literature, we usually set the gradient step size as Θ ( R Π / T d ) Θ subscript 𝑅 Π 𝑇 𝑑 \Theta(R_{\Pi}/\sqrt{Td}) (Hazan, 2016 , Theorem 3.1) , which is consistent with the result above. ∎

D.6 Empirical Validation of Theorem 3 and Theorem 4

We now provide empirical validations for Theorem 3 and Theorem 4 . We provide the training details and the results as follows.

D.6.1 Empirical Validation of Theorem 3

Our model architecture is defined as follows: the number of layers T 𝑇 T is set to 30 and the dimensionality d 𝑑 d to 32, with the loss vector i subscript 𝑖 \ell_{i} ’s distribution Z 𝑍 Z following a standard normal distribution 𝒩 ( 0 , 1 ) 𝒩 0 1 \mathcal{N}(0,1) . During training, we conducted 40,000 epochs with a batch size of 512. We employed the Adam optimizer, setting the learning rate to 0.001. We initialized the value, query, and key vectors ( v c , q c , k c subscript 𝑣 𝑐 subscript 𝑞 𝑐 subscript 𝑘 𝑐 v_{c},q_{c},k_{c} ) as zero vectors.

Our empirical analysis aims to demonstrate that the optimized model inherently emulates online gradient descent. To illustrate this, we will focus on two key convergence properties: K Q superscript 𝐾 𝑄 K^{\intercal}Q approaching the zero matrix 𝑶 d × d subscript 𝑶 𝑑 𝑑 \boldsymbol{O}_{d\times d} and V 𝑉 V converging to a 𝟏 d 𝟏 d + b I d × d 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} , where a 𝑎 a and b 𝑏 b are constants in \mathbb{R} . The conditions K Q = 𝑶 d × d superscript 𝐾 𝑄 subscript 𝑶 𝑑 𝑑 K^{\intercal}Q=\boldsymbol{O}_{d\times d} and V = a 𝟏 d 𝟏 d + b I d × d 𝑉 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 V=a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} imply that the function g ( Z t ; V , Q , K ) = i = 1 t ( b a ) i 𝑔 subscript 𝑍 𝑡 𝑉 𝑄 𝐾 superscript subscript 𝑖 1 𝑡 𝑏 𝑎 subscript 𝑖 g(Z_{t};V,Q,K)=\sum_{i=1}^{t}(b-a)\ell_{i} , effectively emulating the process of an online gradient descent method. We repeated 10 times of the experiments. For verifying K Q = 𝑶 d × d superscript 𝐾 𝑄 subscript 𝑶 𝑑 𝑑 K^{\intercal}Q=\boldsymbol{O}_{d\times d} , we will measure Frobenius norm ( F \|\cdot\|_{F} ) of K Q superscript 𝐾 𝑄 K^{\intercal}Q . Also for measuring the closeness of V 𝑉 V and a 𝟏 d 𝟏 d + b I d × d 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} , we will measure min a , b V ( a 𝟏 d 𝟏 d + b I d × d ) F / b subscript 𝑎 𝑏 subscript norm 𝑉 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 𝐹 𝑏 \min_{a,b\in\mathbb{R}}\|V-(a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d})\|_{F}/b . The results are demonstrated in the first plot of Figure 15 .

D.6.2 Empirical Validation of Theorem 4

We now focus on two key convergence properties: K ( Q 𝟏 d + q c ) superscript 𝐾 𝑄 subscript 1 𝑑 subscript 𝑞 𝑐 K^{\intercal}(Q\boldsymbol{1}_{d}+q_{c}) approaching the zero vector 𝟎 d subscript 0 𝑑 \boldsymbol{0}_{d} and V 𝑉 V converging to a 𝟏 d 𝟏 d + b I d × d 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} , where a 𝑎 a and b 𝑏 b are constants in \mathbb{R} . The conditions K ( Q 𝟏 d + q c ) = 𝟎 d superscript 𝐾 𝑄 subscript 1 𝑑 subscript 𝑞 𝑐 subscript 0 𝑑 K^{\intercal}(Q\boldsymbol{1}_{d}+q_{c})=\boldsymbol{0}_{d} and V = a 𝟏 d 𝟏 d + b I d × d 𝑉 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 V=a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} imply that the function g ( Z t ; V , Q , K ) = i = 1 t ( b a ) i 𝑔 subscript 𝑍 𝑡 𝑉 𝑄 𝐾 superscript subscript 𝑖 1 𝑡 𝑏 𝑎 subscript 𝑖 g(Z_{t};V,Q,K)=\sum_{i=1}^{t}(b-a)\ell_{i} , effectively emulating the process of an online gradient descent method. We repeated 10 times. For verifying K ( Q 𝟏 d + q c ) = 𝟎 d superscript 𝐾 𝑄 subscript 1 𝑑 subscript 𝑞 𝑐 subscript 0 𝑑 K^{\intercal}(Q\boldsymbol{1}_{d}+q_{c})=\boldsymbol{0}_{d} , we will measure 2-norm of K ( Q 𝟏 d + q c ) superscript 𝐾 𝑄 subscript 1 𝑑 subscript 𝑞 𝑐 K^{\intercal}(Q\boldsymbol{1}_{d}+q_{c}) . Also for measuring the closeness of V 𝑉 V and a 𝟏 d 𝟏 d + b I d × d 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} , we will measure min a , b V ( a 𝟏 d 𝟏 d + b I d × d ) F / b subscript 𝑎 𝑏 subscript norm 𝑉 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 𝐹 𝑏 \min_{a,b\in\mathbb{R}}\|V-(a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d})\|_{F}/b . The results are demonstrated in the second plot of Figure 15 .

Refer to caption
Refer to caption
Refer to caption
Figure 15 : Empirical validation of Theorem 3 (top), Theorem 4 (middle), and 6 (bottom). The observed convergence in Theorem 3 and 6 ’s result suggests that configuration in Theorem 3 and 6 are not only the local optimal point, but it has the potential as being the global optimizer.

D.7 Discussions on the Production of FTRL with Entropy Regularization

Now, we will consider projecting a single-layer linear self-attention model into a constrained domain such as a simplex, which is more amenable to the Experts Problem setting. To this end, we consider the following parameterization by adding an additional non-linear structure for the single-layer linear self-attention:

g ( Z t ; V , K , Q , v c , k c , q c ) = Operator ( i = 1 t ( V i + v c ) ( ( K i + k c ) ) ( Q c + q c ) ) ) , \displaystyle g(Z_{t};V,K,Q,v_{c},k_{c},q_{c})=\texttt{Operator}\left(\sum_{i=1}^{t}(V\ell_{i}+v_{c})((K\ell_{i}+k_{c}))^{\intercal}\cdot(Qc+q_{c}))\right), (24)

where the Operator denotes projection to the convex set.

Conjecture 6 .

Assume Σ = I Σ 𝐼 \Sigma=I . Then, the configuration that K ( Q c + q c ) = v c = 𝟎 d superscript 𝐾 𝑄 𝑐 subscript 𝑞 𝑐 subscript 𝑣 𝑐 subscript 0 𝑑 K^{\intercal}(Qc+q_{c})=v_{c}=\boldsymbol{0}_{d} and V = Ω ~ ( 1 n d ) I d × d 𝑉 ~ Ω 1 𝑛 𝑑 subscript 𝐼 𝑑 𝑑 V=\widetilde{\Omega}\left(-\frac{1}{\sqrt{nd}}\right)I_{d\times d} is a first-order stationary point of Equation 3 with N = 1 𝑁 1 N=1 and h ( x ) = x 2 𝑥 superscript 𝑥 2 h(x)=x^{2} when LLM θ subscript LLM 𝜃 \text{LLM}_{\theta} is parameterized with Equation 24 , Operator = Softmax Operator Softmax \texttt{Operator}=\texttt{Softmax} , and Π = Δ ( 𝒜 ) Π Δ 𝒜 \Pi=\Delta(\mathcal{A}) . This configuration performs FTRL with an entropy regularizer which is a no-regret algorithm.

We provide an idea for proving the conjecture, together with its numerical validation. Also, we have observed in Figure 15 that Theorem 3 and 6 might also be a global optimizer, as training results have provided the configuration that Theorem 3 and 6 have suggested.

To be specific, we will consider

f ( V , a , β , v c ) 𝑓 𝑉 𝑎 𝛽 subscript 𝑣 𝑐 \displaystyle f(V,a,\beta,v_{c}) = 𝔼 ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) 2 absent 𝔼 superscript superscript subscript 𝑡 1 𝑇 superscript subscript 𝑠 1 𝑑 subscript 𝑡 𝑠 superscript subscript 𝑒 𝑠 superscript subscript 𝑗 1 𝑡 1 𝑉 subscript 𝑗 superscript subscript 𝑗 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑗 subscript 𝑣 𝑐 𝛽 superscript subscript 𝑦 1 𝑑 superscript subscript 𝑒 𝑦 superscript subscript 𝑗 1 𝑡 1 𝑉 subscript 𝑗 superscript subscript 𝑗 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑗 subscript 𝑣 𝑐 𝛽 subscript 𝑠 superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝑠 2 \displaystyle=\mathbb{E}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)^{2}

and will try to prove that a = 𝟎 d , v c = v 𝟏 d , V = k I formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI is a first-order stationary point.

Step 1 . Calculating f v c 𝑓 subscript 𝑣 𝑐 \frac{\partial f}{\partial v_{c}} .

We use the following formula: for x [ d ] 𝑥 delimited-[] 𝑑 x\in[d] and t 2 𝑡 2 t\geq 2 , we have

v c x exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at subscript 𝑣 𝑐 𝑥 superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\frac{\partial}{\partial v_{cx}}\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) v c x ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I absent evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 subscript 𝑣 𝑐 𝑥 superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle=\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\frac{\partial}{\partial v_{cx}}\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) i = 1 t ( a i i e x + β ) | a = 𝟎 d , v c = v 𝟏 d , V = k I absent evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 superscript subscript 𝑖 1 𝑡 superscript 𝑎 subscript 𝑖 superscript subscript 𝑖 subscript 𝑒 𝑥 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle=\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\sum_{i=1}^{t}\left(a^{\intercal}\ell_{i}\ell_{i}^{\intercal}e_{x}+\beta\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= t β exp ( v β ) exp ( β k i = 1 t i y ) , absent 𝑡 𝛽 𝑣 𝛽 𝛽 𝑘 superscript subscript 𝑖 1 𝑡 subscript 𝑖 𝑦 \displaystyle=t\beta\exp(v\beta)\exp(\beta k\sum_{i=1}^{t}\ell_{iy}),

and for t = 1 𝑡 1 t=1 , v c x exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I = 0 evaluated-at subscript 𝑣 𝑐 𝑥 superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 0 \frac{\partial}{\partial v_{cx}}\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}=0 , so we can use the same formula with t 2 𝑡 2 t\geq 2 . Thus, we have

v c x ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at subscript 𝑣 𝑐 𝑥 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑠 1 𝑑 subscript 𝑡 𝑠 superscript subscript 𝑒 𝑠 superscript subscript 𝑗 1 𝑡 1 𝑉 subscript 𝑗 superscript subscript 𝑗 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑗 subscript 𝑣 𝑐 𝛽 superscript subscript 𝑦 1 𝑑 superscript subscript 𝑒 𝑦 superscript subscript 𝑗 1 𝑡 1 𝑉 subscript 𝑗 superscript subscript 𝑗 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑗 subscript 𝑣 𝑐 𝛽 subscript 𝑠 superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝑠 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\frac{\partial}{\partial v_{cx}}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= β exp ( v β ) absent 𝛽 𝑣 𝛽 \displaystyle=\beta\exp(v\beta)
t = 1 T t s = 1 d t s y = 1 d exp ( j = 1 t 1 β k j y ) exp ( j = 1 t 1 β k j s ) y = 1 d exp ( j = 1 t 1 β k j s ) exp ( j = 1 t 1 β k j y ) ( y = 1 d exp ( e y j = 1 t 1 β V j ) ) 2 superscript subscript 𝑡 1 𝑇 𝑡 superscript subscript 𝑠 1 𝑑 subscript 𝑡 𝑠 superscript subscript 𝑦 1 𝑑 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑘 subscript 𝑗 𝑦 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑘 subscript 𝑗 𝑠 superscript subscript 𝑦 1 𝑑 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑘 subscript 𝑗 𝑠 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑘 subscript 𝑗 𝑦 superscript superscript subscript 𝑦 1 𝑑 superscript subscript 𝑒 𝑦 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑉 subscript 𝑗 2 \displaystyle\qquad\sum_{t=1}^{T}t\sum_{s=1}^{d}\ell_{ts}\frac{\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)-\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)}{\left(\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}\beta V\ell_{j}\right)\right)^{2}}
= 0 . absent 0 \displaystyle=0.

Therefore,

f ( V , a , β , v c ) v c x | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 𝑉 𝑎 𝛽 subscript 𝑣 𝑐 subscript 𝑣 𝑐 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\frac{\partial f(V,a,\beta,v_{c})}{\partial v_{cx}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= 𝔼 [ ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)
v c x ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) ] | a = 𝟎 d , v c = v 𝟏 d , V = k I \displaystyle\qquad\frac{\partial}{\partial v_{cx}}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)\Bigg{]}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= 0 . absent 0 \displaystyle=0.

Step 2 . Calculating f V 𝑓 𝑉 \frac{\partial f}{\partial V} .

The following formula will be used for calculating f V | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 𝑉 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial V}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} : for r , c [ d ] 𝑟 𝑐 delimited-[] 𝑑 r,c\in[d] , we have

V r c subscript 𝑉 𝑟 𝑐 \displaystyle\frac{\partial}{\partial V_{rc}} exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) V r c ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I absent evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 subscript 𝑉 𝑟 𝑐 superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle=\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\frac{\partial}{\partial V_{rc}}\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= exp ( i = 1 t k β i y + v β ) i = 1 t β 𝟏 ( y = r ) i c . absent superscript subscript 𝑖 1 𝑡 𝑘 𝛽 subscript 𝑖 𝑦 𝑣 𝛽 superscript subscript 𝑖 1 𝑡 𝛽 1 𝑦 𝑟 subscript 𝑖 𝑐 \displaystyle=\exp\left(\sum_{i=1}^{t}k\beta\ell_{iy}+v\beta\right)\sum_{i=1}^{t}\beta\boldsymbol{1}(y=r)\ell_{ic}.

Therefore,

f ( V , a , β , v c ) V r c | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 𝑉 𝑎 𝛽 subscript 𝑣 𝑐 subscript 𝑉 𝑟 𝑐 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\frac{\partial f(V,a,\beta,v_{c})}{\partial V_{rc}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= 𝔼 [ ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)
V r c ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) ] | a = 𝟎 d , v c = v 𝟏 d , V = k I \displaystyle\qquad\frac{\partial}{\partial V_{rc}}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)\Bigg{]}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= 𝔼 [ ( t = 1 T s = 1 d t s exp ( j = 1 t 1 β k j s + v β ) y = 1 d exp ( j = 1 t 1 β V j y + v β ) min s t = 1 T t s ) \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}+v\beta\right)}{\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta V\ell_{jy}+v\beta\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)
( t = 1 T s = 1 d t s j = 1 t 1 β 𝟏 ( s = r ) j c exp ( j = 1 t 1 β k j s + v β ) y = 1 d exp ( j = 1 t 1 β k j y + v β ) ( y = 1 d exp ( j = 1 t 1 β k j y + v β ) ) 2 \displaystyle\qquad\Biggl{(}\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\sum_{j=1}^{t-1}\beta\boldsymbol{1}(s=r)\ell_{jc}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}+v\beta\right)\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}+v\beta\right)}{\left(\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}+v\beta\right)\right)^{2}}
t = 1 T s = 1 d t s exp ( j = 1 t 1 β k j s + v β ) y = 1 d ( j = 1 t 1 β 𝟏 ( y = r ) j c exp ( j = 1 t 1 β k j y + v β ) ) ( y = 1 d exp ( j = 1 t 1 β k j y + v β ) ) 2 ) ] \displaystyle\qquad\qquad-\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}+v\beta\right)\sum_{y=1}^{d}\left(\sum_{j=1}^{t-1}\beta\boldsymbol{1}(y=r)\ell_{jc}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}+v\beta\right)\right)}{\left(\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}+v\beta\right)\right)^{2}}\Biggr{)}\Bigg{]}
= β 𝔼 [ ( t = 1 T s = 1 d t s exp ( j = 1 t 1 β k j s ) y = 1 d exp ( j = 1 t 1 β V j y ) min s t = 1 T t s ) \displaystyle=\beta\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)}{\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta V\ell_{jy}\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)
( t = 1 T j = 1 t 1 y = 1 d t r j c exp ( β k j = 1 t 1 j r ) exp ( β k j = 1 t 1 j y ) ( y = 1 d exp ( β k j = 1 t 1 j y ) ) 2 ( i ) \displaystyle\qquad\qquad\qquad\Biggl{(}\underbrace{\frac{\sum_{t=1}^{T}\sum_{j=1}^{t-1}\sum_{y=1}^{d}\ell_{tr}\ell_{jc}\exp\left(\beta k\sum_{j=1}^{t-1}\ell_{jr}\right)\exp\left(\beta k\sum_{j=1}^{t-1}\ell_{jy}\right)}{\left(\sum_{y=1}^{d}\exp\left(\beta k\sum_{j=1}^{t-1}\ell_{jy}\right)\right)^{2}}}_{(i)}
t = 1 T j = 1 t 1 y = 1 d t y j c exp ( β k j = 1 t 1 j r ) exp ( β k j = 1 t 1 j y ) ( y = 1 d exp ( β k j = 1 t 1 j y ) ) 2 ( i i ) ) ] . \displaystyle\qquad\qquad\qquad\qquad-\underbrace{\frac{\sum_{t=1}^{T}\sum_{j=1}^{t-1}\sum_{y=1}^{d}\ell_{ty}\ell_{jc}\exp\left(\beta k\sum_{j=1}^{t-1}\ell_{jr}\right)\exp\left(\beta k\sum_{j=1}^{t-1}\ell_{jy}\right)}{\left(\sum_{y=1}^{d}\exp\left(\beta k\sum_{j=1}^{t-1}\ell_{jy}\right)\right)^{2}}}_{(ii)}\Biggr{)}\Bigg{]}.

We can observe the followings: 1) if r 1 c 1 subscript 𝑟 1 subscript 𝑐 1 r_{1}\neq c_{1} and r 2 c 2 subscript 𝑟 2 subscript 𝑐 2 r_{2}\neq c_{2} , f V r 1 c 1 | a = 𝟎 d , v c = v 𝟏 d , V = k I = f V r 2 c 2 | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑉 subscript 𝑟 1 subscript 𝑐 1 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 evaluated-at 𝑓 subscript 𝑉 subscript 𝑟 2 subscript 𝑐 2 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial V_{r_{1}c_{1}}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}=\frac{\partial f}{\partial V_{r_{2}c_{2}}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} holds, and 2) f V r 1 r 1 | a = 𝟎 d , v c = v 𝟏 d , V = k I = f V r 2 r 2 | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑉 subscript 𝑟 1 subscript 𝑟 1 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 evaluated-at 𝑓 subscript 𝑉 subscript 𝑟 2 subscript 𝑟 2 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial V_{r_{1}r_{1}}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}=\frac{\partial f}{\partial V_{r_{2}r_{2}}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} .

Step 3 . Calculating f β 𝑓 𝛽 \frac{\partial f}{\partial\beta} .

The following formula will be used for calculating f β | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial\beta}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} :

β 𝛽 \displaystyle\frac{\partial}{\partial\beta} exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) β ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I absent evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 𝛽 superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle=\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\frac{\partial}{\partial\beta}\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= t v β exp ( i = 1 t k β i y + v β ) . absent 𝑡 𝑣 𝛽 superscript subscript 𝑖 1 𝑡 𝑘 𝛽 subscript 𝑖 𝑦 𝑣 𝛽 \displaystyle=tv\beta\exp\left(\sum_{i=1}^{t}k\beta\ell_{iy}+v\beta\right).

Further, we have

β ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝛽 superscript subscript 𝑡 1 𝑇 superscript subscript 𝑠 1 𝑑 subscript 𝑡 𝑠 superscript subscript 𝑒 𝑠 superscript subscript 𝑗 1 𝑡 1 𝑉 subscript 𝑗 superscript subscript 𝑗 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑗 subscript 𝑣 𝑐 𝛽 superscript subscript 𝑦 1 𝑑 superscript subscript 𝑒 𝑦 superscript subscript 𝑗 1 𝑡 1 𝑉 subscript 𝑗 superscript subscript 𝑗 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑗 subscript 𝑣 𝑐 𝛽 subscript 𝑠 superscript subscript 𝑡 1 𝑇 subscript 𝑡 𝑠 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\frac{\partial}{\partial\beta}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= v β exp ( v β ) absent 𝑣 𝛽 𝑣 𝛽 \displaystyle=v\beta\exp(v\beta)
t = 1 T t s = 1 d t s y = 1 d exp ( j = 1 t 1 β k j y ) exp ( j = 1 t 1 β k j s ) y = 1 d exp ( j = 1 t 1 β k j s ) exp ( j = 1 t 1 β k j y ) ( y = 1 d exp ( e y j = 1 t 1 β V j ) ) 2 superscript subscript 𝑡 1 𝑇 𝑡 superscript subscript 𝑠 1 𝑑 subscript 𝑡 𝑠 superscript subscript 𝑦 1 𝑑 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑘 subscript 𝑗 𝑦 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑘 subscript 𝑗 𝑠 superscript subscript 𝑦 1 𝑑 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑘 subscript 𝑗 𝑠 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑘 subscript 𝑗 𝑦 superscript superscript subscript 𝑦 1 𝑑 superscript subscript 𝑒 𝑦 superscript subscript 𝑗 1 𝑡 1 𝛽 𝑉 subscript 𝑗 2 \displaystyle\qquad\sum_{t=1}^{T}t\sum_{s=1}^{d}\ell_{ts}\frac{\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)-\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)}{\left(\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}\beta V\ell_{j}\right)\right)^{2}}
= 0 . absent 0 \displaystyle=0.

Step 4 . Calculating f a 𝑓 𝑎 \frac{\partial f}{\partial a} .

Note that

a x subscript 𝑎 𝑥 \displaystyle\frac{\partial}{\partial a_{x}} exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) a x ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) | a = 𝟎 d , v c = v 𝟏 d , V = k I absent evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 subscript 𝑎 𝑥 superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle=\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\frac{\partial}{\partial a_{x}}\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= exp ( e y i = 1 t ( V i i a + ( β V + v c a ) i + v c β ) ) i = 1 t ( e y V i i e x + e y v c i e x ) | a = 𝟎 d , v c = v 𝟏 d , V = k I absent evaluated-at superscript subscript 𝑒 𝑦 superscript subscript 𝑖 1 𝑡 𝑉 subscript 𝑖 superscript subscript 𝑖 𝑎 𝛽 𝑉 subscript 𝑣 𝑐 superscript 𝑎 subscript 𝑖 subscript 𝑣 𝑐 𝛽 superscript subscript 𝑖 1 𝑡 superscript subscript 𝑒 𝑦 𝑉 subscript 𝑖 superscript subscript 𝑖 subscript 𝑒 𝑥 superscript subscript 𝑒 𝑦 subscript 𝑣 𝑐 superscript subscript 𝑖 subscript 𝑒 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle=\exp\left(e_{y}^{\intercal}\sum_{i=1}^{t}(V\ell_{i}\ell_{i}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{i}+v_{c}\beta)\right)\sum_{i=1}^{t}\left(e_{y}^{\intercal}V\ell_{i}\ell_{i}^{\intercal}e_{x}+e_{y}^{\intercal}v_{c}\ell_{i}^{\intercal}e_{x}\right)\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= exp ( i = 1 t β k i y + v β ) i = 1 t ( k i y i x + v i x ) . absent superscript subscript 𝑖 1 𝑡 𝛽 𝑘 subscript 𝑖 𝑦 𝑣 𝛽 superscript subscript 𝑖 1 𝑡 𝑘 subscript 𝑖 𝑦 subscript 𝑖 𝑥 𝑣 subscript 𝑖 𝑥 \displaystyle=\exp\left(\sum_{i=1}^{t}\beta k\ell_{iy}+v\beta\right)\sum_{i=1}^{t}(k\ell_{iy}\ell_{ix}+v\ell_{ix}).

Therefore,

f ( V , a , β , v c ) a x | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 𝑉 𝑎 𝛽 subscript 𝑣 𝑐 subscript 𝑎 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \displaystyle\frac{\partial f(V,a,\beta,v_{c})}{\partial a_{x}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= 𝔼 [ ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)
a x ( t = 1 T s = 1 d t s exp ( e s j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) y = 1 d exp ( e y j = 1 t 1 ( V j j a + ( β V + v c a ) j + v c β ) ) min s t = 1 T t s ) ] | a = 𝟎 d , v c = v 𝟏 d , V = k I \displaystyle\qquad\frac{\partial}{\partial a_{x}}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(e_{s}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}{\sum_{y=1}^{d}\exp\left(e_{y}^{\intercal}\sum_{j=1}^{t-1}(V\ell_{j}\ell_{j}^{\intercal}a+(\beta V+v_{c}a^{\intercal})\ell_{j}+v_{c}\beta)\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)\Bigg{]}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}
= 𝔼 [ ( t = 1 T s = 1 d t s exp ( j = 1 t 1 β k j s ) y = 1 d exp ( j = 1 t 1 β k j y ) min s t = 1 T t s ) \displaystyle=\mathbb{E}\Bigg{[}\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)}{\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)
( t = 1 T s = 1 d t s j = 1 t 1 ( k j s j x + v j x ) exp ( j = 1 t 1 β k j s ) y = 1 d exp ( j = 1 t 1 β k j y ) ( y = 1 d exp ( j = 1 t 1 β k j y ) ) 2 \displaystyle\qquad\Biggl{(}\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\sum_{j=1}^{t-1}(k\ell_{js}\ell_{jx}+v\ell_{jx})\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)}{\left(\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)\right)^{2}}
t = 1 T s = 1 d t s exp ( j = 1 t 1 β k j s ) y = 1 d ( j = 1 t 1 ( k j y j x + v j x ) exp ( j = 1 t 1 β k j y ) ) ( y = 1 d exp ( j = 1 t 1 β k j y ) ) 2 ) ] \displaystyle\qquad\qquad-\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)\sum_{y=1}^{d}\left(\sum_{j=1}^{t-1}(k\ell_{jy}\ell_{jx}+v\ell_{jx})\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)\right)}{\left(\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)\right)^{2}}\Biggr{)}\Bigg{]}
= 𝔼 [ k ( t = 1 T s = 1 d t s exp ( j = 1 t 1 β k j s ) y = 1 d exp ( j = 1 t 1 β k j y ) min s t = 1 T t s ) \displaystyle=\mathbb{E}\Bigg{[}k\left(\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)}{\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)}-\min_{s}\sum_{t=1}^{T}\ell_{ts}\right)
( t = 1 T s = 1 d t s j = 1 t 1 j s j x exp ( j = 1 t 1 β k j s ) y = 1 d exp ( j = 1 t 1 β k j y ) ( y = 1 d exp ( j = 1 t 1 β k j y ) ) 2 \displaystyle\qquad\Biggl{(}\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\sum_{j=1}^{t-1}\ell_{js}\ell_{jx}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)}{\left(\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)\right)^{2}}
t = 1 T s = 1 d t s exp ( j = 1 t 1 β k j s ) y = 1 d ( j = 1 t 1 j y j x exp ( j = 1 t 1 β k j y ) ) ( y = 1 d exp ( j = 1 t 1 β k j y ) ) 2 ) ] \displaystyle\qquad\qquad-\sum_{t=1}^{T}\sum_{s=1}^{d}\ell_{ts}\frac{\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{js}\right)\sum_{y=1}^{d}\left(\sum_{j=1}^{t-1}\ell_{jy}\ell_{jx}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)\right)}{\left(\sum_{y=1}^{d}\exp\left(\sum_{j=1}^{t-1}\beta k\ell_{jy}\right)\right)^{2}}\Biggr{)}\Bigg{]}

Note that the value does not depend on x 𝑥 x , which means that f a | a = 𝟎 d , v c = v 𝟏 d , V = k I = c ~ 𝟏 d evaluated-at 𝑓 𝑎 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 ~ 𝑐 subscript 1 𝑑 \frac{\partial f}{\partial a}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}=\widetilde{c}\boldsymbol{1}_{d} for some constant c ~ ~ 𝑐 \widetilde{c} .

D.7.1 Numerical Analysis of Step 2 and Step 4

In Steps 2 and 4 above, we were not able to show that a k 𝑘 k whose value becomes zero exists. We hence provide some empirical evidence here. First, we attach the estimated f V r c | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑉 𝑟 𝑐 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial V_{rc}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} ( r c 𝑟 𝑐 r\neq c ), f V r r | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑉 𝑟 𝑟 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial V_{rr}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} , f a x | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑎 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial a_{x}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} and f a x | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑎 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial a_{x}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} graph with respect to k 𝑘 k value when t s Unif ( [ 0 , 1 ] ) similar-to subscript 𝑡 𝑠 Unif 0 1 \ell_{ts}\sim\text{Unif}([0,1]) for all t [ T ] , s [ d ] formulae-sequence 𝑡 delimited-[] 𝑇 𝑠 delimited-[] 𝑑 t\in[T],s\in[d] . While the graph of f V | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 𝑉 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial V}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} is not stable, we can see that k 𝑘 k for f V r c | a = 𝟎 d , v c = v 𝟏 d , V = k I = 0 evaluated-at 𝑓 subscript 𝑉 𝑟 𝑐 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 0 \frac{\partial f}{\partial V_{rc}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}=0 , f V r r | a = 𝟎 d , v c = v 𝟏 d , V = k I = 0 evaluated-at 𝑓 subscript 𝑉 𝑟 𝑟 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 0 \frac{\partial f}{\partial V_{rr}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}=0 and f a x | a = 𝟎 d , v c = v 𝟏 d , V = k I = 0 evaluated-at 𝑓 subscript 𝑎 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 0 \frac{\partial f}{\partial a_{x}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI}=0 is very similar in Figure 16 . We used the Monte Carlo estimation of 1 , 000 , 000 1 000 000 1,000,000 times.

Refer to caption
Figure 16 : Calculation of 20 f V r c | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 20 𝑓 subscript 𝑉 𝑟 𝑐 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 20\frac{\partial f}{\partial V_{rc}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} ( r c 𝑟 𝑐 r\neq c )(red), 20 f V r r | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 20 𝑓 subscript 𝑉 𝑟 𝑟 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 20\frac{\partial f}{\partial V_{rr}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} (blue), and f a x | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑎 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial a_{x}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} (black). We experimented with n [ 4 , 9 ] 𝑛 4 9 n\in[4,9] and d [ 4 , 9 ] 𝑑 4 9 d\in[4,9] . The figure might indicate that β k 𝛽 𝑘 \beta k that makes the derivative zero of f V r c | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑉 𝑟 𝑐 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial V_{rc}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} ( r c 𝑟 𝑐 r\neq c ), f V r r | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑉 𝑟 𝑟 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial V_{rr}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} , and f a x | a = 𝟎 d , v c = v 𝟏 d , V = k I evaluated-at 𝑓 subscript 𝑎 𝑥 formulae-sequence 𝑎 subscript 0 𝑑 formulae-sequence subscript 𝑣 𝑐 𝑣 subscript 1 𝑑 𝑉 𝑘 𝐼 \frac{\partial f}{\partial a_{x}}\bigg{|}_{a=\boldsymbol{0}_{d},v_{c}=v\boldsymbol{1}_{d},V=kI} would coincide.

D.7.2 Empirical Validation

Our model architecture is defined as follows: the number of layers T 𝑇 T is set to 30 and the dimensionality d 𝑑 d to 32, with the loss vector l i subscript 𝑙 𝑖 l_{i} ’s distribution Z 𝑍 Z following a standard normal distribution 𝒩 ( 0 , 1 ) 𝒩 0 1 \mathcal{N}(0,1) . During training, we conducted 40,000 epochs with a batch size of 512. We employed the Adam optimizer, setting the learning rate to 0.001. We focus on two key convergence properties: K ( Q 𝟏 + q c ) superscript 𝐾 𝑄 1 subscript 𝑞 𝑐 K^{\intercal}(Q\boldsymbol{1}+q_{c}) approaching the zero vector 𝟎 d subscript 0 𝑑 \boldsymbol{0}_{d} and V 𝑉 V converging to a 𝟏 d 𝟏 d + b I d × d 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} , where a 𝑎 a and b 𝑏 b are constants in \mathbb{R} . The conditions K ( Q 𝟏 + q c ) = 𝟎 d superscript 𝐾 𝑄 1 subscript 𝑞 𝑐 subscript 0 𝑑 K^{\intercal}(Q\boldsymbol{1}+q_{c})=\boldsymbol{0}_{d} and V = a 𝟏 d 𝟏 d + b I d × d 𝑉 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 V=a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} imply that the function g ( Z t ; V , Q , K ) = i = 1 t ( b a ) l i 𝑔 subscript 𝑍 𝑡 𝑉 𝑄 𝐾 superscript subscript 𝑖 1 𝑡 𝑏 𝑎 subscript 𝑙 𝑖 g(Z_{t};V,Q,K)=\sum_{i=1}^{t}(b-a)l_{i} , effectively emulating the process of an online gradient descent method. We repeated 10 times. For verifying K ( Q 𝟏 + q c ) = 𝟎 d superscript 𝐾 𝑄 1 subscript 𝑞 𝑐 subscript 0 𝑑 K^{\intercal}(Q\boldsymbol{1}+q_{c})=\boldsymbol{0}_{d} , we will measure 2-norm of K ( Q 𝟏 + q c ) superscript 𝐾 𝑄 1 subscript 𝑞 𝑐 K^{\intercal}(Q\boldsymbol{1}+q_{c}) . Also for measuring the closeness of V 𝑉 V and a 𝟏 d 𝟏 d + b I d × d 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d} , we will measure min a , b V ( a 𝟏 d 𝟏 d + b I d × d ) 2 , 2 / b subscript 𝑎 𝑏 subscript norm 𝑉 𝑎 subscript 1 𝑑 subscript superscript 1 𝑑 𝑏 subscript 𝐼 𝑑 𝑑 2 2 𝑏 \min_{a,b\in\mathbb{R}}\|V-(a\boldsymbol{1}_{d}\boldsymbol{1}^{\intercal}_{d}+bI_{d\times d})\|_{2,2}/b . The results are demonstrated in the third plot of Figure 15 .

D.8 Training Details of Section 5.4

We provide the training details of Section 5.4 . For the multi-layer Transformer training, we used 4 layers, 1 head Transformer. For both single-layer and multi-layer, we employed the Adam optimizer, setting the learning rate to 0.001. During training, we conducted 2,000 epochs with a batch size 512. Moreover, when we trained for the loss sequences with the predictable trend, we used 4 layers, 1 head Transformer. For both single-layer and multi-layer, we employed the Adam optimizer, setting the learning rate to 0.001. During training, we conducted 9,000 epochs with a batch size of 512.

D.9 Ablation Study on Training Equation 3

In this section, we provide an ablation study that changes N 𝑁 N and k 𝑘 k in Equation 3 . To be specific, we will set N = 1 , 2 , 4 𝑁 1 2 4 N=1,2,4 , f ( x , k ) = max ( x , 0 ) k f(x,k)=\max(x,0)^{k} , h ( x ) = max ( x , 0 ) 2 h(x)=\max(x,0)^{2} , and k = 1 , 2 𝑘 1 2 k=1,2 . For the multi-layer Transformer training, we used 4 layers and 1 head Transformer. For both single-layer and multi-layer, we employed the Adam optimizer, setting the learning rate to 0.001. During training, we conducted 2,000 epochs with a batch size of 512. We experimented on the randomly generated loss sequences. Especially, we used the uniform loss sequence ( t Unif ( [ 0 , 10 ] 2 ) similar-to subscript 𝑡 Unif superscript 0 10 2 \ell_{t}\sim\text{Unif}([0,10]^{2}) ), with the results in Figure 17 and Figure 18 ; and the Gaussian loss sequence ( t 𝒩 ( 5 𝟏 2 , I ) similar-to subscript 𝑡 𝒩 5 subscript 1 2 𝐼 \ell_{t}\sim\mathcal{N}(5\cdot\boldsymbol{1}_{2},I) ), with the results in Figure 19 and Figure 20 .

Refer to caption
Figure 17 : Ablation study for the uniform loss sequence trained with single-layer self-attention layer and Softmax projection.
Refer to caption
Figure 18 : Ablation study for the uniform loss sequence trained with multi-layer self-attention layer and Softmax projection.
Refer to caption
Figure 19 : Ablation study for the Gaussian loss sequence trained with single-layer self-attention layer and Softmax projection.
Refer to caption
Figure 20 : Ablation study for the Gaussian loss sequence trained with single-layer self-attention layer and Softmax projection.